Sign In
Sign In

Kubernetes Cluster Health Checks

Kubernetes Cluster Health Checks
Hostman Team
Technical writer
Kubernetes
28.01.2025
Reading time: 11 min

The Kubernetes containerization platform is a complex system consisting of many different components and internal API objects totaling over 50. When issues arise with the cluster, it is important to know how to troubleshoot them. There are many different health checks available for a Kubernetes cluster and its components — let's go over them today.

Connecting to a Kubernetes Cluster with kubectl

To connect to a Kubernetes cluster using the kubectl command-line utility, you need a kubeconfig configuration file that contains the settings for connecting to the cluster. By default, this file is located in the hidden .kube directory in the user's home directory. The configuration file is located on the master node at /etc/kubernetes/admin.conf.

To copy the configuration file to the user's home directory, you need to run the following command:

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

When using cloud-based Kubernetes clusters, you can download the file from the cluster's management panel. For example, on Hostman:

4cb03fea 995a 43c2 A035 870d6505acda.png

After copying the file to the user's home directory, you need to export the environment variable so that the kubectl utility can locate the configuration file. To do this, run:

export KUBECONFIG=$HOME/.kube/config

Now, the kubectl command will automatically connect to the cluster, and all commands will be applied to the cluster specified in the exported configuration file.

If you're using a kubeconfig file downloaded from the cluster's management panel, you can use the following command:

export KUBECONFIG=/home/user/Downloads/config.yaml

Where /home/user/Downloads/ is the full path to the config.yaml file.

Checking the Client and Server Versions of Kubernetes

Although this check might seem not so obvious, it plays a fundamental role in starting the troubleshooting process. The reason is that for Kubernetes to function stably, the client and server versions of Kubernetes need to be identical to avoid unexpected issues. This is mentioned in the official kubectl installation documentation.

To check the client and server version of your Kubernetes cluster, run the following command:

kubectl version

Image9

In the output of the command, pay attention to the Client Version and Server Version lines. If the client and server versions differ (as in the screenshot above), the following warning will appear in the command output:

WARNING: version difference between client (1.31) and server (1.29) exceeds the supported minor version skew of +/-1.

Retrieving Basic Cluster Information

During cluster health checks, it might be useful to know the IP address or domain name of the control plane component, as well as the address of the embedded Kubernetes DNS server — CoreDNS. To do this, use the following command:

kubectl cluster-info

Image20

For the most detailed information about the cluster, you can obtain a cluster dump using the command:

kubectl cluster-info dump

F56203d6 68bc 4784 Bcb5 2aa27b232319

Note that this command produces a huge amount of data. For further use and analysis, it's a good idea to save the data to a separate file. To do this, redirect the output to a file:

kubectl cluster-info dump > cluster_dump.txt

Retrieving All Available Cluster API Objects

If you need to get a list of all the API objects available in the cluster, run the following command:

kubectl api-resources

Image31

Once you know the names of the cluster objects, you can perform various actions on them — from listing the existing objects to editing and deleting them. Instead of using the full object name, you can use its abbreviation (listed in the SHORTNAMES column), though abbreviations are not supported for all objects.

Cluster Health Check

Let's go over how to check the health of various components within a Kubernetes cluster.

Nodes Health Check

Checking Node Status

Start by checking the status of the cluster nodes. To do this, use the following command:

kubectl get nodes

In the output, pay attention to the STATUS column. Each node should display a Ready status.

Image8

Enhance your node and pod health checks by incorporating metric alerts from Kubernetes Monitoring tutorial. It details how to install VictoriaMetrics and set up monitoring via Prometheus and Grafana for RAM, CPU and disk usage—helping you instantly detect abnormal resource patterns during your cluster health assessments.

Viewing Detailed Information

If any node shows a NotReady status, you can view more detailed information about that node to understand the cause. To do this, use the command:

kubectl describe node <node_name>

In particular, pay attention to the Conditions and Events sections, which show all events on the node. These messages can help determine the cause of the node's unavailability.

D9f5b29b 637a 4538 9e72 F7a9ca43451b

Additionally, the Conditions section displays the status of the following node components:

  • NetworkUnavailable — Shows the status of the network configuration. If there are no network issues, the status will be False. If there are network issues, it will be True.

  • MemoryPressure — Displays the status of memory usage on the node. If sufficient memory is available, the status will be False; if memory is running low, the status will be True.

  • DiskPressure — Displays the status of available disk space on the node. If enough space is available, the status will be False. If disk space is low, the status will be True.

  • PIDPressure — Shows the status of process "overload." If there are only a few processes running, the status will be False. If there are many processes running, the status will be True.

  • Ready — Displays the overall health of the node. If the node is healthy and ready to run pods, the status will be True. If any issues are found (e.g., memory or network problems), the status will be False.

Monitoring Resource Usage

In Kubernetes, you can track resource consumption for both cluster nodes and pod-type objects, as well as containers.

To display resource usage consumed by the cluster nodes, use the command:

kubectl top node

Image26

The top node command shows how much CPU and memory each node is consuming. The values are displayed in millicores for CPU and bytes for memory, and also as percentages.

To display resource usage of all pods across all namespaces running in the cluster, use the command:

kubectl top pod -A

Image18

If you need to display resource usage for pods in a specific namespace, specify the namespace with the -n flag:

kubectl top pod -n kube-system

Image30

To view the resource consumption specifically by containers running in pods, use the --containers option:

kubectl top pod --containers -A

Image29

Viewing Events in the Cluster

To view all events within the cluster, use the following command:

kubectl get events

Image4

It will display all events, regardless of the type of object in the cluster. The following columns are used:

  • LAST SEEN — The time when the event occurred, displayed in seconds, minutes, hours, days, or months.
  • TYPE — Indicates the event's status, which is akin to the severity level. The supported statuses are: Normal, Warning, and Error.
  • REASON — Represents the cause of the event. For example, Starting indicates that an object in the cluster was started, and Pulling means that an image for a container was pulled.
  • OBJECT — The cluster object that triggered the event, including nodes in the cluster (e.g., during initialization).
  • MESSAGE — Displays the detailed message of the event, which can be useful for troubleshooting.

To narrow down the list of events, you can use a specific namespace:

kubectl get events -n kube-system

Image13

For more detailed event output, use the wide option:

kubectl get events -o wide

Image5

The wide option adds additional columns of information, including:

  • SUBOBJECT — Displays the subobject related to the event (e.g., container, volume, secret).
  • SOURCE — The source of the event, which could be components like kubelet, node-controller, etc.
  • FIRST SEEN — The timestamp when the event was first recorded in the cluster.
  • COUNT — The number of times the event has been repeated since it was first seen in the cluster.
  • NAME — The name of the object (e.g., pod, secret) associated with the event.

To view events in real-time, use the -w flag:

kubectl get events -w

Image11

The get events command also supports filtering via the --field-selector option, where you specify a field from the get events output. For example, to display all events with a Warning type in the cluster:

kubectl get events --field-selector type=Warning -A

Image23

Additionally, filtering by timestamps is supported. To display events in the order they first occurred, use the .metadata.creationTimestamp parameter:

kubectl get events --sort-by='.metadata.creationTimestamp'

Monitoring Kubernetes API Server

The API server is a critical component and the "brain" of Kubernetes, that processes all requests to the cluster. The API server should always be available to respond to requests. To check its status, you can use special API endpoints: livez and readyz.

To check the live status of the API server, use the following command:

kubectl get --raw '/livez?verbose'

Image10

To check the readiness status of the API server, use the following command:

kubectl get --raw '/readyz?verbose'

Image15

If both the livez and readyz requests return an ok status, it means the API server is running and ready to handle requests.

Kubernetes Cluster Components

To quickly display the status of all cluster components, use the following command:

kubectl get componentstatuses

Image16

If the STATUS and MESSAGE columns show Healthy and ok, it means the components are running successfully. If any component encounters an error or failure, the STATUS column will display Unhealthy, and the MESSAGE column will provide an error message.

Container Runtime

As is known, Kubernetes itself does not run pods with containers. Instead, it uses an external component called the Container Runtime Interface (CRI), or simply the container runtime. It’s important to ensure that the container runtime environment is functioning correctly. At the time of writing, Kubernetes supports the following container runtimes:

  • containerd
  • CRI-O

It’s worth mentioning that Docker's container runtime is no longer supported starting from Kubernetes version 1.24.

CRI-O

First, check the status of the container runtime. To do this, on the node where the error appears, run the following command:

systemctl status crio

Image21

In the Active line, the status should show active (running). If it shows failed, further investigation is needed in the crio information messages and log files. To display the basic information about crio, including the latest error messages, use the command:

crictl info

Image24

If an error occurs while using crio, the message parameter will show a detailed description. crio also logs all its activities to log files, typically found in the /var/log/crio/pods directory.

Additionally, you can use the journalctl logs. To display all logs for the crio unit, run:

journalctl -u crio

Containerd

As with crio, start by checking the status of the container runtime. On the node where the error appears, run:

systemctl status containerd

Image14

In the Active line, the status should show active (running). If the status shows failed, you can get more detailed information by using the built-in status command, which will display all events, including errors:

containerd status

Alternatively, you can view the logs using journalctl. To display logs for the containerd unit, run:

journalctl -u containerd

You can also check the configuration file parameters for containerd using two commands (the output is usually quite large):

  • containerd config default — Displays the default configuration file. Use this if no changes have been made to the file. If errors occur, this file can be used for rollback.

  • containerd config dump — Displays the current configuration file, which may have been modified.

Pods Health Check

Kubernetes operates with pods, the smallest software units in the cluster, where containers with applications are run. The status of pods should always be READY. To display a list of all pods in the cluster and their statuses, use the following command:

kubectl get po -A

To display pods in a specific namespace, use the -n flag followed by the namespace name:

kubectl get po -n kube-system

For more detailed information about a pod, including any possible errors, use the kubectl describe pod command, which provides the most detailed information about the pod:

kubectl describe pod coredns-6997b8f8bd-b5dq6 -n kube-system

All events related to the pod, including errors, are displayed in the Events section.

Getting Information About Objects with kubectl describe

The kubectl describe command is a powerful tool for finding detailed information about an object, including searching for and viewing various errors. You can apply this command to all Kubernetes objects that are listed in the output of the kubectl api-resources command.

Deployment files are widely used when deploying applications in a Kubernetes cluster. They allow you to control the state of service deployments, including scaling application replicas. To display the statuses of all available deployments in the Kubernetes cluster, use the command:

kubectl get deployments -A

Image28

It is important that the columns READY, UP-TO-DATE, and AVAILABLE display the same number of pods as specified in the deployment file. If the READY column shows 0 or fewer pods than specified, the pod with the application will not be started. To find the error's cause, use the describe command with the type of object, in this case, deployment:

kubectl describe deployment coredns -n kube-system

Just like when using describe for pods, all events, including errors, are displayed in the Conditions section.

Conclusion

Checking the health of a Kubernetes cluster is an important step in troubleshooting and resolving issues. Kubernetes consists of many different components, each with its own verification algorithm. It is important to know what and how to check to identify and fix errors quickly.

Kubernetes
28.01.2025
Reading time: 11 min

Similar

Kubernetes

Liveness, Readiness, and Startup Probes in Kubernetes: Complete Guide

Kubernetes is a powerful container orchestration platform that automates application deployment, scaling, and management. One of the key tasks in container management is ensuring that containers are healthy and ready to handle requests.  In Kubernetes, there are mechanisms known as probes: Liveness, Readiness, and Startup. With their help, Kubernetes monitors container states and makes decisions about restarting them, routing traffic, or waiting for initialization to complete. In this article, we’ll take a detailed look at each probe, how to configure them, common mistakes when using them, and best practices. Each probe will be accompanied by a practical example. We’ll be working through practical examples, which requires a Kubernetes cluster. You can rent a ready-made cluster using a cloud Kubernetes service. For basic service operation, one master node and one worker node with minimal configuration is enough. What are Kubernetes Probes? Probes in Kubernetes are diagnostic checks performed by the kubelet (an agent running on each Kubernetes node) to assess the state of containers. They help determine whether a container is functioning correctly, whether it is ready to accept network traffic, or whether it has finished initialization. Without such checks, Kubernetes cannot know for sure whether an application is in a healthy state, which can lead to service disruptions or incorrect request routing. There are three main types of probes in Kubernetes, each solving its own task: Liveness Probe checks whether the container is “alive,” i.e., working correctly. If the check fails, Kubernetes will automatically restart the container. Readiness Probe determines whether the container is ready to accept incoming network traffic. If the container is not ready, it is excluded from load balancing. Startup Probe is used for applications that require a long startup time, helping to avoid premature restarts or removal from routing. These checks are necessary to ensure fault tolerance and application stability. In particular, they allow Kubernetes to: Automatically restart stuck containers. Exclude containers from handling requests if they are temporarily unready. Control startup of applications with long initialization times. Reduce the likelihood of errors caused by incorrect traffic routing. Probes support three execution mechanisms: HTTP: Sending an HTTP request to a container endpoint. A response code in the range 200–399 is considered successful. TCP: Checking whether a TCP connection can be opened on a specified port. Command: Executing a command inside the container. A return code of 0 means success. Now, let’s look at each type of probe in more detail. Purpose of Liveness Probe The Liveness Probe determines whether a running container is functioning correctly. If the check fails, Kubernetes considers the container unhealthy and automatically restarts it. This is useful in situations where the application hangs, consumes too many resources, or encounters an internal error, but the container process itself continues running. How the Liveness Probe Works The kubelet periodically performs the check defined in the Liveness Probe configuration. If the check fails (for example, an HTTP request returns a 500 code or a command returns a non-zero exit code), Kubernetes increases the counter of failed attempts. After reaching the threshold (set by the failureThreshold parameter), the container will be restarted automatically. Configuration Parameters Below are the parameters used when configuring a Liveness Probe: initialDelaySeconds: Delay before the first check after the container starts (in seconds). periodSeconds: Interval between checks (in seconds). timeoutSeconds: Timeout for waiting for a response (in seconds). successThreshold: Minimum number of successful checks for the container to be considered “healthy” (usually set to 1). failureThreshold: Number of failed checks after which the container is considered “unhealthy.” Practical Example Let’s see how to use a Liveness Probe in practice. Below is the manifest: --- apiVersion: v1 kind: Namespace metadata: name: test-liveness-probe --- apiVersion: apps/v1 kind: Deployment metadata: name: test-liveness-probe-http namespace: test-liveness-probe spec: replicas: 1 selector: matchLabels: app: nginx-liveness template: metadata: labels: app: nginx-liveness spec: containers: - name: nginx--test-container image: nginx:1.26.0 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 15 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 In this configuration, an Nginx web server image is used, and a Liveness Probe is set up to periodically perform an HTTP request to the root / endpoint on port 80 to monitor application health. The first check starts 15 seconds after the container launches (initialDelaySeconds) and is performed every 10 seconds (periodSeconds). If the HTTP request to /healthz returns a response code in the range 200–399, the check is considered successful. However, if three consecutive checks fail (failureThreshold), the container will be restarted automatically. Save the configuration above to a file named test-liveness-probe.yaml and apply it: kubectl apply -f test-liveness-probe.yaml Check that the pod has started successfully: kubectl get pods -n test-liveness-probe As you can see in the screenshot above, the pod started successfully and is in Running status. Now, let’s verify that the probe works. Check the pod logs with: kubectl logs test-liveness-probe-http-6bf85d548b-xc9lf -n test-liveness-probe (Remember to replace the pod name test-liveness-probe-http-6bf85d548b-xc9lf with the one displayed by the command above.) In the output, we will see that the probe sends a request every 10 seconds and successfully receives a response. Next, let’s test how the pod behaves if we change the probe to use a non-existent endpoint. Update the configuration like this: livenessProbe:   httpGet:     path: /nonexistent     port: 80 Save the changes and apply them: kubectl apply -f test-liveness-probe.yaml Check the pod status: kubectl get pods -n test-liveness-probe The pod is running, but note the RESTARTS column, which shows the number of pod restarts. In this case, the pod has restarted twice and will continue restarting. This is because of the Liveness Probe settings: if three consecutive checks fail, the container is restarted automatically. Typical Use Cases Liveness probes are vital for ensuring the continuous availability and reliability of applications running in Kubernetes. For instance, in a high-traffic production environment, a liveness probe ensures that if a container becomes unresponsive or encounters an issue, it is automatically restarted, minimizing downtime and improving the user experience. It helps ensure that faulty containers do not remain active for prolonged periods, which could affect application performance or lead to outages. Action on Liveness Probe Failure If a Liveness probe fails, Kubernetes takes action by restarting the container. This is crucial for scenarios like deadlocks, where the application within the container becomes unresponsive, or if the application is in a hung state, unable to recover on its own. The restart mechanism allows Kubernetes to recover the container and restore service availability without manual intervention. Purpose of Readiness Probe The Readiness Probe determines whether a container is ready to accept incoming network traffic. If the check fails, Kubernetes removes the container from routing (for example, from a Service or Ingress object), but does not restart the container. This allows temporary isolation of a container that is not ready to handle requests, for example, during a database update or cache loading. How the Readiness Probe Works The kubelet performs the check similarly to the Liveness Probe. If the check succeeds, the container is considered ready, and Kubernetes includes it in traffic routing. If the check fails, the container is excluded from routing but continues running. Configuration Parameters The Readiness Probe uses the same parameters as the Liveness Probe, but their values may differ. Practical Example Let’s look at a practical example of using a Readiness Probe. Below is a manifest that configures a readiness check using HTTP: apiVersion: v1 kind: Namespace metadata: name: test-readiness-probe --- apiVersion: apps/v1 kind: Deployment metadata: name: nginx-test-deployment namespace: test-readiness-probe labels: app: nginx spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx-web-server image: nginx:1.14.2 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 successThreshold: 1 failureThreshold: 3 In this configuration, we use an Nginx web server image. The check is performed every 10 seconds on port 80 (the default port for Nginx). If the probe fails three consecutive times on port 80, the container is removed from routing. The first check begins 5 seconds after the container starts. Save the configuration above into a file named test-readiness-probe.yaml and apply it: kubectl apply -f test-readiness-probe.yaml Verify that the pod has started successfully: kubectl get pods -n test-readiness-probe As shown in the screenshot above, the pod started successfully and is in Running status. Now let’s check probe behavior in case of a failure. To “break” the container, connect to the pod and remove the index file index.html: kubectl exec -it nginx-deployment-5c8b8b9669-q8vrh -n test-readiness-probe -- rm /usr/share/nginx/html/index.html Next, check the pod status: kubectl get pods -n test-readiness-probe As seen in the screenshot above, the pod will be marked as not ready (0/1 in the READY column). Check the pod events with: kubectl describe pod nginx-deployment-5c8b8b9669-q8vrh -n test-readiness-probe In the Events section, a message appears indicating that the Readiness Probe failed and returned a 403 error. To make the pod accept incoming traffic again, simply delete it: kubectl delete pod nginx-deployment-5c8b8b9669-q8vrh -n test-readiness-probe Check pod status again: kubectl get pods -n test-readiness-probe The pod is now ready to work and can accept incoming network traffic. Typical Use Cases Readiness probes are particularly important for applications that require time for initialization, such as database services or applications that need to load data or establish connections before becoming operational. For example, in a microservices architecture, a service may depend on other services being fully up and running before it can start accepting traffic. Readiness probes prevent such services from being overwhelmed by traffic requests during initialization, allowing for a smoother deployment process. Action on Readiness Probe Failure If the readiness probe fails, Kubernetes temporarily removes the Pod from the list of endpoints for the associated service, ensuring that traffic is not routed to the container while it is still in the process of becoming ready. The container will only be reintroduced to the service once the readiness probe succeeds, confirming that it can now handle incoming traffic. Purpose of Startup Probe The Startup Probe is intended for applications that require more time to start. It allows Kubernetes to wait for container initialization to complete before starting Liveness or Readiness checks. Without the Startup Probe, a slow application may be restarted due to failed Liveness checks, even though it simply hasn’t finished starting yet. How the Startup Probe Works The Startup Probe runs until it either succeeds or exceeds the failure threshold. Once it succeeds, Kubernetes begins running the Liveness and Readiness Probes (if configured). If the check fails, the container is restarted. Configuration Parameters The Startup Probe uses parameters similar to other probes but typically with higher values for initialDelaySeconds and failureThreshold to account for the application’s long startup time. Practical Example Let’s see how to use a Startup Probe in practice. Here’s a configuration for an application with a long initialization time: apiVersion: v1 kind: Namespace metadata: name: test-startup-probe --- apiVersion: apps/v1 kind: Deployment metadata: name: startup-demo namespace: test-startup-probe spec: replicas: 1 selector: matchLabels: app: startup-demo template: metadata: labels: app: startup-demo spec: containers: - name: demo-container image: nginx:alpine ports: - containerPort: 80 lifecycle: postStart: exec: command: ["/bin/sh", "-c", "sleep 30 && touch /usr/share/nginx/html/ready"] startupProbe: exec: command: ["cat", "/usr/share/nginx/html/ready"] failureThreshold: 10 periodSeconds: 5 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 5 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 5 In this configuration, a container based on an Nginx image is launched, simulating a slow startup. Kubernetes uses three probes to manage container state: Startup Probe: Checks if the container has finished starting by running cat /usr/share/nginx/html/ready every 5 seconds (periodSeconds: 5). If the ready file exists and is accessible (command succeeds), the container is considered started. The probe is retried up to 10 times (failureThreshold: 10), giving a maximum of 50 seconds for a successful startup. If all attempts fail, the container restarts. Liveness Probe: Checks whether the container continues running correctly by sending an HTTP GET request to / on port 80 every 5 seconds (periodSeconds: 5) after an initial delay of 5 seconds (initialDelaySeconds: 5). If the server responds with a code in the 200–399 range, the check is successful. Otherwise, the container is considered unhealthy and restarted. Readiness Probe: Determines whether the container is ready to accept traffic by also sending an HTTP GET request to / on port 80 every 5 seconds (periodSeconds: 5) after a 5-second delay. If successful (response code 200–399), the container is included in load balancing. If it fails, the container is excluded from routing but not restarted. Save the configuration to a file named test-startup-probe.yaml and apply it: kubectl apply -f test-startup-probe.yaml Check pod status: kubectl get pods -n test-startup-probe As shown in the screenshot above, the pod starts but is not ready (0/1 in the READY column). During the first 30 seconds (because of sleep 30 in postStart), the pod remains in Running status but not Ready, since the Startup Probe is waiting for the /usr/share/nginx/html/ready file to appear. After 30 seconds, the pod becomes ready for work. Typical Use Cases Startup probes are designed for applications that have a long or complex initialization process. For example, a database system might need several minutes to fully initialize before it can serve traffic or respond to requests. Without a startup probe, Kubernetes might prematurely restart the container based on failed liveness or readiness probes, causing unnecessary restarts and delays in the application becoming operational. The startup probe ensures the container is given sufficient time to complete its startup sequence before being evaluated by other probes. Common mistakes Using incorrect values. Small values for periodSeconds or timeoutSeconds can lead to false positives due to temporary delays. You should set reasonable values (for example, periodSeconds: 10, timeoutSeconds: 3). Missing endpoints for checks. If the application does not define endpoints (e.g., /healthz or /ready), HTTP checks will fail. Implement the necessary endpoints in the application during development. Overloading the container. Frequent checks can overload the application, especially if they perform complex operations. Use lightweight checks, such as TCP instead of HTTP, if that is sufficient. Ignoring the Startup Probe. Without a Startup Probe, slow applications may be restarted because of failed Liveness checks. You need to use and properly configure a Startup Probe for applications with long startup times. Best practices Separate Liveness and Readiness probes. The Liveness Probe should check whether the application is running, while the Readiness Probe should check whether it is ready to accept network traffic. For example, Liveness can check for process existence, while Readiness can check the availability of external dependencies. Use a Startup Probe for slow applications. If an application takes more than 10–15 seconds to start, configure a Startup Probe to avoid premature restarts. Implement lightweight checks. HTTP checks should return minimal data to reduce application load. TCP checks are preferable if checking port availability is sufficient. Take dependencies into account. If the application depends on a database or another service, configure the Readiness Probe to check the availability of those dependencies. Key Differences Liveness vs. Readiness: Liveness probes are concerned with the container's overall health and are designed to trigger restarts if the container is unresponsive. In contrast, Readiness probes focus on whether the container is ready to serve traffic, preventing requests from being routed to a container that isn't fully initialized or capable of handling them. Considerations When configuring probes, it’s essential to select the appropriate probe type based on your application's behavior and requirements. Consider the time it takes for your application to become responsive, as well as any initialization tasks it may need to complete. Proper configuration of probes ensures accurate health checks and resource management, minimizing unnecessary restarts and optimizing container resource usage. Conclusion Liveness, Readiness, and Startup Probes in Kubernetes are critical tools that allow you to: Monitor container health Automatically restart failed instances Exclude unready containers from routing Give slow applications enough time for initialization Proper probe configuration requires understanding how the application works and carefully tuning the parameters. Using probes in Kubernetes not only increases application stability but also simplifies infrastructure management by automating responses to failures and state changes.
18 September 2025 · 15 min to read
Kubernetes

How to Install Kubecost: Full Installation Guide

Kubecost is a tool for monitoring and managing costs in Kubernetes. It helps you understand in real time how much resources (CPU, RAM, storage, etc.) each component (pod, service, namespace, deployment) is consuming, and how that translates into money. It is mainly used to monitor costs per service and optimize resource usage. Kubecost brings cost transparency, letting you see how much each application or namespace costs. Unused resources are automatically identified. This tool is useful for DevOps engineers in managing and optimizing resources, financial analysts in tracking infrastructure spending, and project managers in allocating costs across teams and projects. In this article, we’ll go through the installation, integration, and initial configuration of Kubecost. Installing Kubecost Let’s walk through the installation of Kubecost step by step. Step Zero: Create and Connect to a Kubernetes Cluster To use Kubecost, you’ll need: A Kubernetes cluster with a supported version (1.16 or newer). Sufficient resources in the cluster (a minimum of 2 CPUs and 4 GB RAM is recommended for Kubecost pods). A cluster management tool like kubectl. Hostman’s cloud infrastructure provides the ability to create a Kubernetes cluster with a recommended configuration (2 CPUs @ 3.3 GHz, 4 GB RAM, 60 GB NVMe). We described the process of creating a cluster in the documentation. For easier monitoring, you can also install the Kubernetes Dashboard with a single click. Once the cluster is created, connect to it—we recommend using Lens. The connection process is also described in detail in our docs. You’ll need a terminal with the cluster’s context. To access it, navigate to the Overview tab in Lens and click the Terminal button located at the bottom. All command-line operations will be performed in this terminal. Step One: Choose a Storage Type Kubernetes requires dedicated storage to function properly. For development, Local Path Provisioner is a good option; for production, we recommend an external fault-tolerant storage solution. Local Path Provisioner This is convenient in test and local environments where a single node and low fault tolerance are sufficient. However, in clusters with multiple nodes under active testing, it may not be enough since it’s limited to local disks. Here’s how to install it using Rancher’s ready-made manifest: curl -s https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml | kubectl apply -f - Expected output: namespace/local-path-storage created serviceaccount/local-path-provisioner-service-account created role.rbac.authorization.k8s.io/local-path-provisioner-role created clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created rolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created deployment.apps/local-path-provisioner created storageclass.storage.k8s.io/local-path createdconfigmap/local-path-config created Ensure the pod is running: kubectl get pods -n local-path-storage Expected output: NAME                               READY   STATUS    RESTARTS   AGE local-path-provisioner-xxx         1/1     Running   0          68s After installation, a StorageClass named local-path should appear: kubectl get sc Expected output: NAME         PROVISIONER              ... VOLUMEBINDINGMODE     AGE local-path   rancher.io/local-path    ... WaitForFirstConsumer  5s To set the created local-path as the default storage class: kubectl patch storageclass local-path \   -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' Expected output: storageclass.storage.k8s.io/local-path patched External Storage For production use, where highly available volumes and automatic node-failure recovery are important, choose more reliable solutions than Local Path Provisioner, such as S3 storage. Step Two: Install Kubecost Add the Kubecost Helm repository and update it: helm repo add kubecost https://kubecost.github.io/cost-analyzer/ helm repo update Now use one of the following Helm commands: If the cluster has a default StorageClass: helm install kubecost kubecost/cost-analyzer \   --namespace kubecost --create-namespace If the cluster does NOT have a default StorageClass: helm install kubecost kubecost/cost-analyzer \   --namespace kubecost --create-namespace \   --set global.storageClass=<STORAGECLASS> Expected output: Kubecost 2.x.x has been successfully installed. Step Three: Verify Installation Check that the PersistentVolumeClaims (PVCs) created by Kubecost are in Bound status: kubectl get pvc -n kubecost Expected output (trimmed for clarity): NAME                         STATUS kubecost-cost-analyzer       Bound kubecost-prometheus-server   Bound Make sure each PVC shows Bound. Next, ensure all pods are running and error-free: kubectl get pod -n kubecost Expected output: NAME                              READY   STATUS    RESTARTS kubecost-cost-analyzer-xxx        4/4     Running   0 kubecost-forecasting-xxx          1/1     Running   0 kubecost-grafana-xxx              2/2     Running   0 kubecost-prometheus-server-xxx    1/1     Running   0 If you see this, Kubecost is installed correctly. Step Four: Port Forwarding To manage Kubecost and view its metrics, you need to port-forward to your local machine. First, identify the service used by Kubecost: kubectl get svc -n kubecost Expected output (trimmed): NAME                       TYPE        CLUSTER-IP        EXTERNAL-IP   PORT kubecost-cost-analyzer     ClusterIP   10.111.138.113    <none>        9090/TCP The desired service is typically kubecost-cost-analyzer, and its port is 9090. Forward it: kubectl port-forward -n kubecost service/kubecost-cost-analyzer 9090:9090 Expected output: Forwarding from 127.0.0.1:9090 -> 9090 Forwarding from [::1]:9090 -> 9090 Now you can use Kubecost via the web UI at http://localhost:9090. How to Configure Kubecost Note: The Kubecost UI may vary by version; button labels, metrics, and other elements may change. Go to the Settings tab (on the left, might be hidden) for initial configuration. Cost Model Configuration Filling out this section is recommended for accurate cost calculations. Scroll to the Pricing section and enable the Enable Custom Pricing toggle. The app will prompt you to enter resource pricing manually. If using Hostman, you can find this pricing info on the Create Cluster page, under section 3. Worker Nodes Configuration, tab Custom. There, sliders will display the cost of the selected configuration. Note: In Hostman, the cost of fixed-configuration worker nodes is lower than that of equivalent custom-configured ones. Example field entry: Field Value Description Monthly CPU Price $1.80 Price per 1 vCPU Monthly Spot CPU Price* 0 Price per 1 Spot vCPU Monthly RAM Price $1.50 Price per 1 GB of RAM Monthly Spot RAM Price* 0 Price per 1 GB of Spot RAM Monthly GPU Price* 0 Price per 1 GPU Monthly Storage Price $0.04 Price per 1 GB of storage * — Not used in Hostman. Custom Labels Labels are used to identify, group, and detail costs associated with Kubernetes resources. Scroll to the Labels section. It’s similar in layout to the cost model section. Name Description Default Value Owner Label / Annotation Indicates resource owner (e.g., user or team)* owner Team Label Defines the team using the resource* team Department Label Links the resource to a department or cost center* department Product Label Specifies the app/product the resource is for* app Environment Label Indicates the environment (dev, prod, staging, etc.) env GPU Label Node-level label indicating GPU type — GPU Label Value Label value indicating GPU presence — * — supports CSV format. Prometheus Status Check Kubecost retrieves metrics from Prometheus. Scroll down to Prometheus Status—it’s near the bottom of the Settings page. You should see green checkmarks for each metric (as shown in the screenshot). If metrics are missing, Kubecost may not work as expected. For full diagnostics, visit: http://localhost:9090/diagnostics. Alert Configuration Kubecost can notify users of unexpected events. Alerts can be sent via email, Slack, webhooks, or Microsoft Teams. Go to the Alerts tab. Under Global Recipients, enter the contacts for global alert delivery. Below that, you can define alert types and specific recipients. Each type is described below: Name Description Allocation Budget Budget for cost allocation at namespace/team/project level. Notifies on overage. Allocation Efficiency Resource usage efficiency (e.g., CPU, RAM) within budgets or namespaces. Allocation Recurring Update Regular updates on resource allocation and costs. Allocation Spend Change Notifies of significant changes in resource spend. Asset Budget Budget for physical/virtual resources (nodes, GPUs, disks). Alerts on overage. Asset Recurring Update Regular updates on physical/virtual resource usage. Cloud Cost Budget Budget for cloud costs. Alerts when exceeded. Uninstalling and Reinstalling Kubecost Sometimes, full uninstallation is required to fix issues—for example, if no default StorageClass was set during the initial install. To remove Kubecost completely: helm uninstall kubecost -n kubecost kubectl delete ns kubecost To reinstall, follow Step Two again. Troubleshooting Common Issues Error Symptoms Solution Out of memory OOMKilled, logs show CrashLoopBackOff Add new worker nodes via Hostman (Resources tab). Kubernetes will reschedule the pods. Lack of CPU or disk Pods stuck in CrashLoopBackOff; Prometheus shows incomplete data Add more resources, check Prometheus logs for retention or WAL errors. Prometheus out of disk space Logs show Storage retention limit reached, WAL write errors Resize disk (for external storage), or add a new disk and migrate Prometheus data (local). UI slow / Graphs timing out Graphs load slowly or timeout Increase resources.requests/limits; optimize Prometheus retention and use recording rules. No PersistentVolume for PVC Error: 0/2 nodes ... no available persistent volumes to bind Refer to Step One, reinstall Kubecost with proper storage. PVC stuck in Pending kubectl get pvc shows Pending; no PV or no StorageClass Ensure storage class exists or set manually. Missing metrics in UI No data/graphs; logs show Unable to query Prometheus Verify Prometheus is running and has enough disk. Helm install fails Errors like chart not found, or failed resource creation Retry Step Two, ensure you have proper RBAC permissions. UI inaccessible via port Port-forward runs, but http://<node_ip>:9090 fails Use http://localhost:9090 if running locally; configure NodePort or LoadBalancer access. Zero dollar cost in UI Cost Allocation shows $0 or no data Manually define the cost model under Settings > On-Prem. Conclusion Kubecost is a powerful tool for monitoring and optimizing Kubernetes costs. It helps make infrastructure spending transparent and manageable. This guide covered the full installation and configuration process, including cluster preparation, choosing a storage class, Helm-based deployment, cost model setup, and Prometheus integration. Effective use of Kubecost not only helps reduce expenses but also improves resource management across teams, projects, and applications. By following this guide, you’ll be able to deploy and tailor Kubecost to suit your infrastructure needs.
25 July 2025 · 10 min to read
Kubernetes

Kubernetes Backup

The Kubernetes containerization platform processes and stores large volumes of data from various cluster components, including persistent storage blocks (Persistent Volumes), various manifests, and configuration files such as Deployments, ConfigMaps, and Secrets. It is important to organize backups to protect this data. There are various solutions for simplifying the Kubernetes backup process. One of them is Velero, specifically designed to create Kubernetes cluster backups. Today, we will take a detailed look at the process of creating backups using Velero. Prerequisites A deployed and running Kubernetes cluster. It can be a self-hosted cluster deployed or a Kubernetes cluster in the Hostman cloud. Object storage for backup files. In this guide, we will use Hostman S3 object storage. A server or a computer from which we will manage the cluster and install Velero. We'll use a machine with Ubuntu 24.04. kubectl utility installed. The major version of kubectl should not differ from that of the cluster. For instance, if the cluster version is 1.31, you can use versions from 1.30 to 1.32. To download a specific version of kubectl, specify it in the URL, for example: curl -LO https://dl.k8s.io/release/v1.32.0/bin/linux/amd64/kubectl After installation, check the version: kubectl version --client Helm package manager installed. Helm simplifies installing, upgrading, and managing applications within a Kubernetes cluster. Helm organizes complex Kubernetes configurations into manageable packages called charts. Creating S3 Storage S3 is an object storage service for reliable storage of large datasets. Since Velero requires object storage, let's create one in the S3 Storage section of the Hostman management panel. Click the Create button: For this guide, we'll select the minimum storage size of 10 GB. In practice, you should choose a size that meets your needs. Set the storage type to Public. You can also rename the bucket if needed. Velero Overview Velero is an open-source client-server utility for creating backups and restoring Kubernetes cluster resources. It works with Kubernetes objects (such as Pods, Deployments, and Services) and saves them as snapshots. Additionally, it can back up data from Persistent Volume (PV) objects. Velero Key Features: Backup Creation: Save the state of the Kubernetes cluster, including manifests and Persistent Volumes. Data Restoration: Restore the entire cluster or individual resources from a backup. Data Migration: Move resources between Kubernetes clusters. Velero Architecture The Velero architecture consists of the following key components: Velero Server (deployed inside the Kubernetes cluster): The server component runs as a Deployment object within the Kubernetes cluster. It handles backup and recovery tasks. CLI (deployed outside the cluster): The client component provides a command-line interface for managing Velero and sends commands to the Velero server. Cloud Storage Provider Plugins: Used to interact with data storage services (e.g., Amazon S3, Google Cloud Storage, and Azure Blob Storage). Preparing the kubeconfig File To connect to a cluster, you need the kubeconfig file — a special YAML file containing connection details for the cluster. If you are using a Kubernetes cluster from Hostman, you can download the kubeconfig file from the Dashboard of your cluster. Next, export the KUBECONFIG environment variable, specifying the full path to the kubeconfig file. Linux and macOS In the terminal, run the following command: export KUBECONFIG=/root/Daring_Linnet_config.yaml Windows In the Windows PowerShell, use this command: $env:KUBECONFIG = "C:\Users\alex\plugins\container-service\clusters\customername\Daring_Linnet_config.yaml" Replace Daring_Linnet_config.yaml with the name of your kubeconfig file. After exporting the environment variable, check the connection to the cluster by listing all available nodes: kubectl get nodes If the command returns a list of nodes, we have successfully connected to the cluster. Installing Velero Installing the Client Component As mentioned earlier, Velero consists of a client (CLI) and a server component. We'll start by installing the client, which provides a command-line interface. Download the .tar archive for the Velero client and extract it. We'll use version 1.15.1: curl -L https://github.com/vmware-tanzu/velero/releases/download/v1.15.1/velero-v1.15.1-linux-amd64.tar.gz | tar -xz The output will be a directory named velero-v1.15.1-linux-amd64 (where v1.15.1 is the version used). Move the directory to /usr/local/bin: mv velero-v1.15.1-linux-amd64/velero /usr/local/bin/ Check the utility's functionality by displaying its version: velero version If the version is displayed, the client component has been successfully installed. Now we will proceed with the installation of the server component. Installing the Server Component One way to install the server component of Velero is through a Helm chart. To install Velero using Helm, follow these steps: Create a new namespace named velero: kubectl create namespace velero Create a new Kubernetes Secret object to store the aws_access_key_id and aws_secret_access_key variables. These keys are essential for authenticating and authorizing access to S3 storage. S3 Access Key: A public identifier used to identify the user or application making the request. S3 Secret Access Key: A private key used to digitally sign requests. Keep this key confidential. To find the S3 Access Key and S3 Secret Access Key, go to the S3 Storage section in the Hostman management panel and click on the bucket. Copy these values and create a new file named velero-credentials-secret.yaml: nano velero-credentials-secret.yaml Add the following content: apiVersion: v1 kind: Secret metadata: name: cloud-credentials namespace: velero type: Opaque stringData: cloud: | [default] aws_access_key_id = UOY3beX5A3bV9Ly aws_secret_access_key = F3x78pH1d5BOu4BfVv Create the secret in Kubernetes: kubectl apply -f velero-credentials-secret.yaml Add the official vmware-tanzu Helm repository: helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts Update the repository list: helm repo update List the repositories to confirm the addition: helm repo ls Install Velero using the following command: helm install velero vmware-tanzu/velero \ --namespace velero \ --set credentials.existingSecret=cloud-credentials \ --set 'configuration.backupStorageLocation[0].name=default' \ --set 'configuration.backupStorageLocation[0].provider=aws' \ --set 'configuration.backupStorageLocation[0].bucket=f60e2023-bucket-for-velero' \ --set 'configuration.backupStorageLocation[0].config.region=us-2' \ --set 'configuration.backupStorageLocation[0].config.s3ForcePathStyle=true' \ --set 'configuration.backupStorageLocation[0].config.s3Url=https://s3.hostman.com' \ --set 'configuration.volumeSnapshotLocation[0].name=default' \ --set 'configuration.volumeSnapshotLocation[0].provider=aws' \ --set 'configuration.volumeSnapshotLocation[0].config.region=us-2' \ --set 'initContainers[0].name=velero-plugin-for-aws' \ --set 'initContainers[0].image=velero/velero-plugin-for-aws:v1.7.0' \ --set 'initContainers[0].volumeMounts[0].mountPath=/target' \ --set 'initContainers[0].volumeMounts[0].name=plugins' In the configuration.backupStorageLocation[0].bucket parameter, specify the bucket name, which you can find in the Hostman control panel. Run the installation command. If there are no errors, a message will confirm that Velero has been deployed in the cluster. To monitor its status, use: kubectl get deployment/velero -n velero The deployment file is successfully launched, as indicated by the READY and UP-TO-DATE statuses. You can also check the status of the Velero pod: kubectl get pods -n velero If the pod is running, you can optionally check its logs (where velero-7bb8d5c5f-jwg5c is the Velero pod name): kubectl logs velero-7bb8d5c5f-jwg5c -n velero The Velero installation is now fully complete. Backup Using Velero To test the backup process, we will create a new namespace and several Kubernetes objects within it. Create a namespace named test-velero: kubectl create ns test-velero Create a Deployment file with two containers running the NGINX web server and a LoadBalancer service.  nano nginx-dev.yaml Add the following configuration: apiVersion: apps/v1 kind: Deployment metadata: name: nginx-dev namespace: test-velero labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx:1.17.6 name: nginx ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: labels: app: nginx name: nginx-test-service namespace: test-velero spec: ports: - port: 80 targetPort: 80 selector: app: nginx type: LoadBalancer Apply the file and create the resources: kubectl apply -f nginx-dev.yaml Verify the status of the created resources: kubectl get all -n test-velero Creating a Backup To create a backup for all resources in the test-velero namespace, run the following command: velero backup create nginx-test-backup --include-namespaces test-velero If the backup was created successfully, you will see the following message: Backup request "nginx-test-backup" submitted successfully.Run `velero backup describe nginx-test-backup` or `velero backup logs nginx-test-backup` for more details. You can check the status with the describe command:  velero backup describe nginx-test-backup If successful, the status will be Completed. Listing Backups To view all backups in the storage, run: velero backup get The output will display the status (STATUS), number of errors (ERRORS), warnings (WARNINGS), creation time (CREATED), and expiration time (EXPIRES) for each backup. Restoring a Backup To test the restoration process, first delete the previously created namespace and all objects within it: kubectl delete namespace test-velero Restore the backup by specifying its name (nginx-test-backup): velero restore create --from-backup nginx-test-backup Check the restoration status using the following command, providing the name of the restored copy (obtained from the velero restore create output): velero restore describe nginx-test-backup-20250114155656 If successful, the status will be Completed. Viewing Backup Files To view backup files, navigate to the Objects tab in the S3 Storage section in your Hostman control panel. Velero creates separate directories for: Backups: containing backup data for the respective resources. Restorations: containing details about restored objects. Each directory contains the corresponding Kubernetes objects for backup and restoration purposes. Useful Commands for Backup with Velero Velero offers extensive backup functionality, allowing you to create backups for specific objects or configurations. Below are some useful examples: Scheduled Backup for Specific Namespaces To automatically create backups for all objects in the default and my-namespace namespaces every day at 2:00 AM: velero schedule create daily-backup --schedule="0 2 * * *" --include-namespaces default,my-namespace Backup for Specific Resources To create a backup only for objects of type deployment in the default namespace: velero backup create my-backup2 --include-resources deployments --include-namespaces default Full Cluster Backup To back up the entire Kubernetes cluster, including cluster-scoped resources such as ClusterRole, ClusterRoleBinding, CustomResourceDefinition (CRD), PersistentVolume, and StorageClass: velero backup create full-cluster-backup Backup by Label Selector To back up only objects with a specific label, for instance, those with the selector app=nginx: velero backup create backup-with-label-nginx --selector "app=nginx" Backup Excluding a Label Selector To back up only objects without a specific label selector, such as excluding objects labeled app=nginx: velero backup create backup-with-no-label-nginx --selector "app=nginx" Excluding a Specific Namespace To exclude the kube-system namespace and all its objects from the backup: velero backup create backup-exclude-kube-system --exclude-namespaces kube-system Excluding Specific Resources To exclude all secrets from the backup: velero backup create backup-exclude-secrets --exclude-resources secrets Before running production backups, validate node, pod, and volume health as described in Kubernetes Cluster Health Checks—covering viewing detailed information about resources and various components  to ensure all resources are ready. Conclusion In this practical guide, we covered how to install Velero and how to use it to create Kubernetes backups and restore data. Velero's rich functionality allows for quick and straightforward backup-related tasks, making it a valuable tool for maintaining data safety and cluster reliability.
04 February 2025 · 11 min to read

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start.
Email us
Hostman's Support