The Kubernetes containerization platform is a complex system consisting of many different components and internal API objects totaling over 50. When issues arise with the cluster, it is important to know how to troubleshoot them. There are many different health checks available for a Kubernetes cluster and its components — let's go over them today.
To connect to a Kubernetes cluster using the kubectl
command-line utility, you need a kubeconfig
configuration file that contains the settings for connecting to the cluster. By default, this file is located in the hidden .kube
directory in the user's home directory. The configuration file is located on the master node at /etc/kubernetes/admin.conf
.
To copy the configuration file to the user's home directory, you need to run the following command:
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
When using cloud-based Kubernetes clusters, you can download the file from the cluster's management panel. For example, on Hostman:
After copying the file to the user's home directory, you need to export the environment variable so that the kubectl
utility can locate the configuration file. To do this, run:
export KUBECONFIG=$HOME/.kube/config
Now, the kubectl
command will automatically connect to the cluster, and all commands will be applied to the cluster specified in the exported configuration file.
If you're using a kubeconfig
file downloaded from the cluster's management panel, you can use the following command:
export KUBECONFIG=/home/user/Downloads/config.yaml
Where /home/user/Downloads/
is the full path to the config.yaml
file.
Although this check might seem not so obvious, it plays a fundamental role in starting the troubleshooting process. The reason is that for Kubernetes to function stably, the client and server versions of Kubernetes need to be identical to avoid unexpected issues. This is mentioned in the official kubectl
installation documentation.
To check the client and server version of your Kubernetes cluster, run the following command:
kubectl version
In the output of the command, pay attention to the Client Version
and Server Version
lines. If the client and server versions differ (as in the screenshot above), the following warning will appear in the command output:
WARNING: version difference between client (1.31) and server (1.29) exceeds the supported minor version skew of +/-1.
During cluster health checks, it might be useful to know the IP address or domain name of the control plane component, as well as the address of the embedded Kubernetes DNS server — CoreDNS. To do this, use the following command:
kubectl cluster-info
For the most detailed information about the cluster, you can obtain a cluster dump using the command:
kubectl cluster-info dump
Note that this command produces a huge amount of data. For further use and analysis, it's a good idea to save the data to a separate file. To do this, redirect the output to a file:
kubectl cluster-info dump > cluster_dump.txt
If you need to get a list of all the API objects available in the cluster, run the following command:
kubectl api-resources
Once you know the names of the cluster objects, you can perform various actions on them — from listing the existing objects to editing and deleting them. Instead of using the full object name, you can use its abbreviation (listed in the SHORTNAMES
column), though abbreviations are not supported for all objects.
Let's go over how to check the health of various components within a Kubernetes cluster.
Start by checking the status of the cluster nodes. To do this, use the following command:
kubectl get nodes
In the output, pay attention to the STATUS
column. Each node should display a Ready
status.
If any node shows a NotReady
status, you can view more detailed information about that node to understand the cause. To do this, use the command:
kubectl describe node <node_name>
In particular, pay attention to the Conditions
and Events
sections, which show all events on the node. These messages can help determine the cause of the node's unavailability.
Additionally, the Conditions
section displays the status of the following node components:
NetworkUnavailable
— Shows the status of the network configuration. If there are no network issues, the status will be False. If there are network issues, it will be True.
MemoryPressure
— Displays the status of memory usage on the node. If sufficient memory is available, the status will be False; if memory is running low, the status will be True.
DiskPressure
— Displays the status of available disk space on the node. If enough space is available, the status will be False. If disk space is low, the status will be True.
PIDPressure
— Shows the status of process "overload." If there are only a few processes running, the status will be False. If there are many processes running, the status will be True.
Ready
— Displays the overall health of the node. If the node is healthy and ready to run pods, the status will be True. If any issues are found (e.g., memory or network problems), the status will be False.
In Kubernetes, you can track resource consumption for both cluster nodes and pod-type objects, as well as containers.
To display resource usage consumed by the cluster nodes, use the command:
kubectl top node
The top node
command shows how much CPU and memory each node is consuming. The values are displayed in millicores for CPU and bytes for memory, and also as percentages.
To display resource usage of all pods across all namespaces running in the cluster, use the command:
kubectl top pod -A
If you need to display resource usage for pods in a specific namespace, specify the namespace with the -n
flag:
kubectl top pod -n kube-system
To view the resource consumption specifically by containers running in pods, use the --containers
option:
kubectl top pod --containers -A
To view all events within the cluster, use the following command:
kubectl get events
It will display all events, regardless of the type of object in the cluster. The following columns are used:
LAST SEEN
— The time when the event occurred, displayed in seconds, minutes, hours, days, or months.TYPE
— Indicates the event's status, which is akin to the severity level. The supported statuses are: Normal
, Warning
, and Error
.REASON
— Represents the cause of the event. For example, Starting
indicates that an object in the cluster was started, and Pulling
means that an image for a container was pulled.OBJECT
— The cluster object that triggered the event, including nodes in the cluster (e.g., during initialization).MESSAGE
— Displays the detailed message of the event, which can be useful for troubleshooting.To narrow down the list of events, you can use a specific namespace:
kubectl get events -n kube-system
For more detailed event output, use the wide
option:
kubectl get events -o wide
The wide
option adds additional columns of information, including:
SUBOBJECT
— Displays the subobject related to the event (e.g., container, volume, secret).SOURCE
— The source of the event, which could be components like kubelet
, node-controller
, etc.FIRST SEEN
— The timestamp when the event was first recorded in the cluster.COUNT
— The number of times the event has been repeated since it was first seen in the cluster.NAME
— The name of the object (e.g., pod, secret) associated with the event.To view events in real-time, use the -w
flag:
kubectl get events -w
The get events
command also supports filtering via the --field-selector
option, where you specify a field from the get events output. For example, to display all events with a Warning
type in the cluster:
kubectl get events --field-selector type=Warning -A
Additionally, filtering by timestamps is supported. To display events in the order they first occurred, use the .metadata.creationTimestamp
parameter:
kubectl get events --sort-by='.metadata.creationTimestamp'
The API server is a critical component and the "brain" of Kubernetes, that processes all requests to the cluster. The API server should always be available to respond to requests. To check its status, you can use special API endpoints: livez
and readyz
.
To check the live status of the API server, use the following command:
kubectl get --raw '/livez?verbose'
To check the readiness status of the API server, use the following command:
kubectl get --raw '/readyz?verbose'
If both the livez
and readyz
requests return an ok
status, it means the API server is running and ready to handle requests.
To quickly display the status of all cluster components, use the following command:
kubectl get componentstatuses
If the STATUS
and MESSAGE
columns show Healthy
and ok
, it means the components are running successfully. If any component encounters an error or failure, the STATUS
column will display Unhealthy
, and the MESSAGE
column will provide an error message.
As is known, Kubernetes itself does not run pods with containers. Instead, it uses an external component called the Container Runtime Interface (CRI), or simply the container runtime. It’s important to ensure that the container runtime environment is functioning correctly. At the time of writing, Kubernetes supports the following container runtimes:
It’s worth mentioning that Docker's container runtime is no longer supported starting from Kubernetes version 1.24.
First, check the status of the container runtime. To do this, on the node where the error appears, run the following command:
systemctl status crio
In the Active
line, the status should show active (running)
. If it shows failed, further investigation is needed in the crio
information messages and log files. To display the basic information about crio
, including the latest error messages, use the command:
crictl info
If an error occurs while using crio
, the message parameter will show a detailed description. crio
also logs all its activities to log files, typically found in the /var/log/crio/pods directory
.
Additionally, you can use the journalctl
logs. To display all logs for the crio
unit, run:
journalctl -u crio
As with crio
, start by checking the status of the container runtime. On the node where the error appears, run:
systemctl status containerd
In the Active
line, the status should show active (running)
. If the status shows failed
, you can get more detailed information by using the built-in status command, which will display all events, including errors:
containerd status
Alternatively, you can view the logs using journalctl
. To display logs for the containerd unit, run:
journalctl -u containerd
You can also check the configuration file parameters for containerd using two commands (the output is usually quite large):
containerd config default
— Displays the default configuration file. Use this if no changes have been made to the file. If errors occur, this file can be used for rollback.
containerd config dump
— Displays the current configuration file, which may have been modified.
Kubernetes operates with pods, the smallest software units in the cluster, where containers with applications are run. The status of pods should always be READY
. To display a list of all pods in the cluster and their statuses, use the following command:
kubectl get po -A
To display pods in a specific namespace, use the -n
flag followed by the namespace name:
kubectl get po -n kube-system
For more detailed information about a pod, including any possible errors, use the kubectl
describe pod command, which provides the most detailed information about the pod:
kubectl describe pod coredns-6997b8f8bd-b5dq6 -n kube-system
All events related to the pod, including errors, are displayed in the Events
section.
The kubectl describe
command is a powerful tool for finding detailed information about an object, including searching for and viewing various errors. You can apply this command to all Kubernetes objects that are listed in the output of the kubectl api-resources
command.
Deployment files are widely used when deploying applications in a Kubernetes cluster. They allow you to control the state of service deployments, including scaling application replicas. To display the statuses of all available deployments in the Kubernetes cluster, use the command:
kubectl get deployments -A
It is important that the columns READY
, UP-TO-DATE
, and AVAILABLE
display the same number of pods as specified in the deployment file. If the READY
column shows 0 or fewer pods than specified, the pod with the application will not be started. To find the error's cause, use the describe command with the type of object, in this case, deployment:
kubectl describe deployment coredns -n kube-system
Just like when using describe for pods, all events, including errors, are displayed in the Conditions
section.
Checking the health of a Kubernetes cluster is an important step in troubleshooting and resolving issues. Kubernetes consists of many different components, each with its own verification algorithm. It is important to know what and how to check to identify and fix errors quickly.