Kubernetes Cluster Health Checks
The Kubernetes containerization platform is a complex system consisting of many different components and internal API objects totaling over 50. When issues arise with the cluster, it is important to know how to troubleshoot them. There are many different health checks available for a Kubernetes cluster and its components — let's go over them today.
Connecting to a Kubernetes Cluster with kubectl
To connect to a Kubernetes cluster using the kubectl command-line utility, you need a kubeconfig configuration file that contains the settings for connecting to the cluster. By default, this file is located in the hidden .kube directory in the user's home directory. The configuration file is located on the master node at /etc/kubernetes/admin.conf.
To copy the configuration file to the user's home directory, you need to run the following command:
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
When using cloud-based Kubernetes clusters, you can download the file from the cluster's management panel. For example, on Hostman:
After copying the file to the user's home directory, you need to export the environment variable so that the kubectl utility can locate the configuration file. To do this, run:
export KUBECONFIG=$HOME/.kube/config
Now, the kubectl command will automatically connect to the cluster, and all commands will be applied to the cluster specified in the exported configuration file.
If you're using a kubeconfig file downloaded from the cluster's management panel, you can use the following command:
export KUBECONFIG=/home/user/Downloads/config.yaml
Where /home/user/Downloads/ is the full path to the config.yaml file.
Checking the Client and Server Versions of Kubernetes
Although this check might seem not so obvious, it plays a fundamental role in starting the troubleshooting process. The reason is that for Kubernetes to function stably, the client and server versions of Kubernetes need to be identical to avoid unexpected issues. This is mentioned in the official kubectl installation documentation.
To check the client and server version of your Kubernetes cluster, run the following command:
kubectl version
In the output of the command, pay attention to the Client Version and Server Version lines. If the client and server versions differ (as in the screenshot above), the following warning will appear in the command output:
WARNING: version difference between client (1.31) and server (1.29) exceeds the supported minor version skew of +/-1.
Retrieving Basic Cluster Information
During cluster health checks, it might be useful to know the IP address or domain name of the control plane component, as well as the address of the embedded Kubernetes DNS server — CoreDNS. To do this, use the following command:
kubectl cluster-info
For the most detailed information about the cluster, you can obtain a cluster dump using the command:
kubectl cluster-info dump
Note that this command produces a huge amount of data. For further use and analysis, it's a good idea to save the data to a separate file. To do this, redirect the output to a file:
kubectl cluster-info dump > cluster_dump.txt
Retrieving All Available Cluster API Objects
If you need to get a list of all the API objects available in the cluster, run the following command:
kubectl api-resources
Once you know the names of the cluster objects, you can perform various actions on them — from listing the existing objects to editing and deleting them. Instead of using the full object name, you can use its abbreviation (listed in the SHORTNAMES column), though abbreviations are not supported for all objects.
Cluster Health Check
Let's go over how to check the health of various components within a Kubernetes cluster.
Nodes Health Check
Checking Node Status
Start by checking the status of the cluster nodes. To do this, use the following command:
kubectl get nodes
In the output, pay attention to the STATUS column. Each node should display a Ready status.
Viewing Detailed Information
If any node shows a NotReady status, you can view more detailed information about that node to understand the cause. To do this, use the command:
kubectl describe node <node_name>
In particular, pay attention to the Conditions and Events sections, which show all events on the node. These messages can help determine the cause of the node's unavailability.
Additionally, the Conditions section displays the status of the following node components:
NetworkUnavailable — Shows the status of the network configuration. If there are no network issues, the status will be False. If there are network issues, it will be True.
MemoryPressure — Displays the status of memory usage on the node. If sufficient memory is available, the status will be False; if memory is running low, the status will be True.
DiskPressure — Displays the status of available disk space on the node. If enough space is available, the status will be False. If disk space is low, the status will be True.
PIDPressure — Shows the status of process "overload." If there are only a few processes running, the status will be False. If there are many processes running, the status will be True.
Ready — Displays the overall health of the node. If the node is healthy and ready to run pods, the status will be True. If any issues are found (e.g., memory or network problems), the status will be False.
Monitoring Resource Usage
In Kubernetes, you can track resource consumption for both cluster nodes and pod-type objects, as well as containers.
To display resource usage consumed by the cluster nodes, use the command:
kubectl top node
The top node command shows how much CPU and memory each node is consuming. The values are displayed in millicores for CPU and bytes for memory, and also as percentages.
To display resource usage of all pods across all namespaces running in the cluster, use the command:
kubectl top pod -A
If you need to display resource usage for pods in a specific namespace, specify the namespace with the -n flag:
kubectl top pod -n kube-system
To view the resource consumption specifically by containers running in pods, use the --containers option:
kubectl top pod --containers -A
Viewing Events in the Cluster
To view all events within the cluster, use the following command:
kubectl get events
It will display all events, regardless of the type of object in the cluster. The following columns are used:
LAST SEEN — The time when the event occurred, displayed in seconds, minutes, hours, days, or months.
TYPE — Indicates the event's status, which is akin to the severity level. The supported statuses are: Normal, Warning, and Error.
REASON — Represents the cause of the event. For example, Starting indicates that an object in the cluster was started, and Pulling means that an image for a container was pulled.
OBJECT — The cluster object that triggered the event, including nodes in the cluster (e.g., during initialization).
MESSAGE — Displays the detailed message of the event, which can be useful for troubleshooting.
To narrow down the list of events, you can use a specific namespace:
kubectl get events -n kube-system
For more detailed event output, use the wide option:
kubectl get events -o wide
The wide option adds additional columns of information, including:
SUBOBJECT — Displays the subobject related to the event (e.g., container, volume, secret).
SOURCE — The source of the event, which could be components like kubelet, node-controller, etc.
FIRST SEEN — The timestamp when the event was first recorded in the cluster.
COUNT — The number of times the event has been repeated since it was first seen in the cluster.
NAME — The name of the object (e.g., pod, secret) associated with the event.
To view events in real-time, use the -w flag:
kubectl get events -w
The get events command also supports filtering via the --field-selector option, where you specify a field from the get events output. For example, to display all events with a Warning type in the cluster:
kubectl get events --field-selector type=Warning -A
Additionally, filtering by timestamps is supported. To display events in the order they first occurred, use the .metadata.creationTimestamp parameter:
kubectl get events --sort-by='.metadata.creationTimestamp'
Monitoring Kubernetes API Server
The API server is a critical component and the "brain" of Kubernetes, that processes all requests to the cluster. The API server should always be available to respond to requests. To check its status, you can use special API endpoints: livez and readyz.
To check the live status of the API server, use the following command:
kubectl get --raw '/livez?verbose'
To check the readiness status of the API server, use the following command:
kubectl get --raw '/readyz?verbose'
If both the livez and readyz requests return an ok status, it means the API server is running and ready to handle requests.
Kubernetes Cluster Components
To quickly display the status of all cluster components, use the following command:
kubectl get componentstatuses
If the STATUS and MESSAGE columns show Healthy and ok, it means the components are running successfully. If any component encounters an error or failure, the STATUS column will display Unhealthy, and the MESSAGE column will provide an error message.
Container Runtime
As is known, Kubernetes itself does not run pods with containers. Instead, it uses an external component called the Container Runtime Interface (CRI), or simply the container runtime. It’s important to ensure that the container runtime environment is functioning correctly. At the time of writing, Kubernetes supports the following container runtimes:
containerd
CRI-O
It’s worth mentioning that Docker's container runtime is no longer supported starting from Kubernetes version 1.24.
CRI-O
First, check the status of the container runtime. To do this, on the node where the error appears, run the following command:
systemctl status crio
In the Active line, the status should show active (running). If it shows failed, further investigation is needed in the crio information messages and log files. To display the basic information about crio, including the latest error messages, use the command:
crictl info
If an error occurs while using crio, the message parameter will show a detailed description. crio also logs all its activities to log files, typically found in the /var/log/crio/pods directory.
Additionally, you can use the journalctl logs. To display all logs for the crio unit, run:
journalctl -u crio
Containerd
As with crio, start by checking the status of the container runtime. On the node where the error appears, run:
systemctl status containerd
In the Active line, the status should show active (running). If the status shows failed, you can get more detailed information by using the built-in status command, which will display all events, including errors:
containerd status
Alternatively, you can view the logs using journalctl. To display logs for the containerd unit, run:
journalctl -u containerd
You can also check the configuration file parameters for containerd using two commands (the output is usually quite large):
containerd config default — Displays the default configuration file. Use this if no changes have been made to the file. If errors occur, this file can be used for rollback.
containerd config dump — Displays the current configuration file, which may have been modified.
Pods Health Check
Kubernetes operates with pods, the smallest software units in the cluster, where containers with applications are run. The status of pods should always be READY. To display a list of all pods in the cluster and their statuses, use the following command:
kubectl get po -A
To display pods in a specific namespace, use the -n flag followed by the namespace name:
kubectl get po -n kube-system
For more detailed information about a pod, including any possible errors, use the kubectl describe pod command, which provides the most detailed information about the pod:
kubectl describe pod coredns-6997b8f8bd-b5dq6 -n kube-system
All events related to the pod, including errors, are displayed in the Events section.
Getting Information About Objects with kubectl describe
The kubectl describe command is a powerful tool for finding detailed information about an object, including searching for and viewing various errors. You can apply this command to all Kubernetes objects that are listed in the output of the kubectl api-resources command.
Deployment files are widely used when deploying applications in a Kubernetes cluster. They allow you to control the state of service deployments, including scaling application replicas. To display the statuses of all available deployments in the Kubernetes cluster, use the command:
kubectl get deployments -A
It is important that the columns READY, UP-TO-DATE, and AVAILABLE display the same number of pods as specified in the deployment file. If the READY column shows 0 or fewer pods than specified, the pod with the application will not be started. To find the error's cause, use the describe command with the type of object, in this case, deployment:
kubectl describe deployment coredns -n kube-system
Just like when using describe for pods, all events, including errors, are displayed in the Conditions section.
Conclusion
Checking the health of a Kubernetes cluster is an important step in troubleshooting and resolving issues. Kubernetes consists of many different components, each with its own verification algorithm. It is important to know what and how to check to identify and fix errors quickly.
28 January 2025 · 11 min to read