Cluster autoscaling in Kubernetes allows for flexible resource management by automatically adding or removing nodes and pods based on the workload.
The maximum number of nodes in a group cannot exceed the maximum allowable number of worker nodes in the cluster. For example, if the cluster limit is set to 100 nodes and one of the groups is already using 30 nodes, the new group can have a maximum of 70 nodes.
To enable autoscaling during cluster creation, select the Autoscaling option in the Worker nodes configuration section.
The minimum number of nodes at the cluster creation stage is 1.

If the cluster has already been created, you can enable autoscaling as follows:
The minimum number of nodes is 0.

The following parameters are used to manage the creation and deletion of nodes:
--scale-down-unneeded-time 5m0s
--max-scale-down-parallelism 1
--provisioning-request-max-backoff-time 2m0s
--max-inactivity 3m0s
--scale-down-delay-after-add 2m0s
--max-node-provision-time 10m0s
--scan-interval 2m0s
--scale-up-from-zero=true
--scale-down-unready-enabled=true
--scale-down-unready-time=30m0s
|
Parameter |
Description |
|
|
Time a node must remain idle before being removed (in this case, 5 minutes). |
|
|
Maximum number of nodes that can be removed simultaneously (in this case, 1). |
|
|
Maximum wait time for node creation (3 minutes). |
|
|
Maximum inactivity time before node reduction can begin (3 minutes). |
|
|
Delay time after adding a node before reduction can begin (3 minutes). |
|
|
Maximum time allowed for creating a new node (3 minutes). |
These parameters determine how quickly nodes are added or removed based on the current workload. For instance, if the workload decreases and a node remains idle for more than 5 minutes, it will be removed. However, node removal happens gradually to avoid a sudden reduction in resources. When the workload increases, new nodes are added, but this process is also controlled to prevent long delays, with a maximum time limit of 3 minutes.
In this way, autoscaling aims to maintain a balance between resource availability and cost-efficiency.
Autoscaling works only if requests and limits are specified in the deployment. Below is an example of a deployment file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
spec:
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
containers:
- name: example-container
image: example-image
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "400m"
memory: "512Mi"
m), where 1000m equals one core.Mi) or gigabytes (Gi).If a pod manifest uses nodeSelector or nodeAffinity, Kubernetes schedules the pod only on nodes that match the specified conditions. This also affects autoscaling: the autoscaler can create a new node only in a node group where the pod can be scheduled.
If no suitable node group exists, the pod will remain in the Pending state, even if autoscaling is enabled in the cluster.
nodeSelector is suitable for simple cases where pods must run only on nodes with a specific label. For example, if an application should run only in a particular node group, you can use the following manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker-deployment
spec:
replicas: 1
selector:
matchLabels:
app: worker
template:
metadata:
labels:
app: worker
spec:
nodeSelector:
workload-type: background
containers:
- name: worker
image: busybox
command: ["sh", "-c", "sleep infinity"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
In this example, the pod can be scheduled only on nodes with the label workload-type=background. If no suitable nodes exist in the cluster, the autoscaler will be able to add a new node only to a group that uses this label.
nodeAffinity is used when more flexible scheduling rules are required. Unlike nodeSelector, it allows you to define expressions with different operators and combine multiple conditions.
Below is an example equivalent to the previous nodeSelector, but implemented using nodeAffinity:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
spec:
replicas: 1
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values:
- api
containers:
- name: api
image: nginx:latest
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "400m"
memory: "512Mi"
In this case, the pod will be scheduled only on nodes with the label workload-type=api. This example uses the In operator, which allows specifying one or more acceptable values for a label.
nodeAffinity also supports other operators, for example:
In: the label value must be in the specified listNotIn: the label value must not be in the specified listExists: the label must be present on the nodeDoesNotExist: the label must not be presentGt: the label value must be greater than the specified value (for numeric values)Lt: the label value must be less than the specified value (for numeric values)These operators allow defining more advanced scheduling rules compared to nodeSelector.
Use nodeSelector when simple matching by one or more fixed values is sufficient. For more complex placement requirements, use nodeAffinity.
The Horizontal Pod Autoscaler (HPA) is used to dynamically adjust the number of pods based on resource consumption, such as CPU or memory. HPA scales pods to maintain a specified level of resource usage and adapt to changing workload conditions. Below is an example of an HPA manifest:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: example-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-deployment
minReplicas: 1
maxReplicas: 5
targetCPUUtilizationPercentage: 50
HPA parameters:
apiVersion: Specifies the API version used to create the resource. In this case, it is autoscaling/v1.
kind: The type of resource being created, here it is HorizontalPodAutoscaler.
metadata: Contains metadata information, including the name of the HPA (example-hpa).
scaleTargetRef: Specifies the object to scale, with the following details:
apiVersion: The API version of the resource being scaled.
kind: The type of resource to scale, in this case, Deployment.
name: The name of the resource to scale (example-deployment).
minReplicas: The minimum number of pod replicas to maintain, here it is set to 1.
maxReplicas: The maximum number of pod replicas to scale up to, in this case, 5.
targetCPUUtilizationPercentage: The CPU usage percentage at which the number of replicas will be adjusted. In this example, it is set to 50, meaning the number of replicas will be adjusted to maintain an average CPU utilization of 50%.
To set up autoscaling, let’s create an Nginx deployment with defined resource limits:
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
This deployment creates an Nginx pod with specified resource limits, essential for the proper functioning of HPA.
To demonstrate pod autoscaling with Nginx, use the HorizontalPodAutoscaler (HPA) object. HPA automatically adjusts the number of pod replicas based on resource utilization.
hpa.yaml:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 1
maxReplicas: 4
targetCPUUtilizationPercentage: 20
In this example, HPA adjusts the number of replicas for nginx-deployment based on CPU utilization, maintaining it at 20%.
To distribute load evenly among pods, it is recommended to use a Service object of the LoadBalancer type.
service.yaml:
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
This service provides access to pods using load balancing across multiple ports.
To apply the YAML files, use the following commands:
kubectl apply -f nginx-deployment.yaml
kubectl apply -f nginx-hpa.yaml
kubectl apply -f nginx-service.yaml
These commands will deploy the configurations for the Nginx deployment, HPA, and load balancer, creating all the necessary components for autoscaling.
Before generating load, ensure all resources are created and in the Running state. To observe autoscaling in action, generate load on the service using the hey tool, which sends HTTP requests to test performance.
Generate Load:
hey -z 10m -c 20 http://<load_balancer_ip>
This command generates 10 minutes of load with 20 parallel connections to the Nginx service, increasing resource usage and triggering scaling of pod replicas and, if necessary, worker nodes.
After some time, you can observe the scaling:
Check node scaling in the cluster:
kubectl get nodes
Check pod scaling, which will increase up to 4 replicas as defined in hpa.yaml:
kubectl get pods