In containerized environments, dynamic, on-demand scaling is crucial to ensure optimal resource usage, performance, and cost-efficiency. Kubernetes offers a dedicated API resource for this purpose, named Kubernetes Horizontal Pod Autoscaling (HPA).
In this article, we will delve into Kubernetes HPA, covering its fundamental concepts, architecture, and practical implementation. From the initial setup and testing to troubleshooting common issues, our goal is to guide you through the entire process. By the end of this article, you should have a comprehensive understanding of how to leverage HPA to auto-scale your clusters.
Kubernetes HPA is a controller that can auto-scale (both up and down) the number of pod replicas in a deployment or StatefulSet in real time, based on configured metrics. In simpler terms, it automates the horizontal scaling process by adding or removing pod replicas to match the desired performance metrics. This ensures that a cluster is able to cater to fluctuating workloads without over-provisioning or wasting workloads.
Let’s suppose you have a web application running inside a Kubernetes cluster, and it experiences a sudden increase in traffic due to a promotion or an event. Without Kubernetes HPA, you'd need to manually increase the number of running pods to handle the higher load.
However, if you configure HPA, it will detect that your application pod’s CPU, memory, and other metrics are breaching certain thresholds and will automatically add more pod replicas to distribute the load. Once the traffic decreases, HPA will scale down the number of pods to prevent resource wastage.
HPA offers several benefits for today’s complex and resource-intensive distributed infrastructures.
Kubernetes offers three primary autoscaling mechanisms: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler. Each approach serves a distinct purpose and addresses different scaling scenarios. Here’s a table summarizing their differences:
HPA | VPA | Cluster Autoscaler | |
---|---|---|---|
Purpose | Scales pods horizontally | Scales individual pods vertically | Scales nodes in a cluster |
Mechanism | Adds or removes pods to a cluster based on predefined metrics | Adjusts the CPU and memory resource requests and limits for individual pods | Adds or removes nodes based on resource utilization across the cluster |
Scaling unit | Pods | Pods (internally) | Nodes |
Responsiveness | Real time | Less responsive than HPA | Slowest |
Suitable for | Applications with fluctuating workloads and moderate resource requirements | Applications with unique resource requirements or those sensitive to pod startup time | Applications with dynamic workloads and high resource requirements |
Kubernetes HPA uses an intermittent control loop to periodically check resource utilization and adjust the number of pods. By default, the control loop runs every 15 seconds, but you can set a custom interval by passing the “--horizontal-pod-autoscaler-sync-period” argument to the kube-controller-manager daemon.
During each iteration, the HPA controller gathers resource utilization metrics for the target resources defined in the HPA definitions, either from the resource metrics API for per-pod resource metrics or from the custom metrics API for other metrics.
For per-pod resource metrics (e.g., CPU), if a target utilization value is defined, the controller calculates the percentage of utilization based on the containers' resource requests in each pod. When a target raw value is set, it directly uses the raw metric values. Next, the controller computes the mean of either the utilization or raw values across all targeted Pods. This resulting ratio is then used to scale the desired number of replicas.
For per-pod custom metrics, the controller follows a process similar to per-pod resource metrics but uses raw values instead of utilization values. For object metrics and external metrics, a single metric describing the object is fetched and compared to the target value. The generated ratio is then used to adjust the number of replicas.
As per the official Kubernetes docs, the HPA uses the following algorithm:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
For example, consider a scenario where the current metric value is 300m and the desired value is 150m. In this case, the number of replicas adjusts to twice the current count as 300 / 150 equals 2. Similarly, if the current metric value were 75m, the replicas would be halved, as 75 / 150 equals 0.5.
Before performing any scaling actions, the control plane analyzes missing metrics and the readiness status of Pods. Pods with a deletion timestamp set or failed Pods are not considered in the calculations. Pods that don’t have metrics are ignored for the time being but may be examined later.
It’s important to note that the control plane doesn’t perform any scaling when the ratio is too close to 1.0. The configurable tolerance value defaults to 0.1.
Custom metrics allow you to go beyond the built-in scaling metrics that Kubernetes offers. Let’s look at some reasons why they are important:
We will explore how to set up custom metrics in a later section.
To set up HPA in a cluster, we will have to define a YAML file that outlines the desired HPA configurations. Consider the following example:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sample-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-deployment
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
In the above file, the “scaleTargetRef” field depicts the deployment on which the HPA is activated. The “minReplicas” and “maxReplicas” specify the minimum and maximum number of replicas. The metrics section defines that metric used for autoscaling, which is “cpu” in our case.
The “type” field of the metrics signifies that it’s a utilization metric, whereas the “averageUtilization” parameter sets the target utilization of the metric.
Once you have created the HPA configuration file, you can apply it to your cluster using this command:
kubectl apply -f hpa.yaml
To verify that the HPA has been created successfully, run this command:
kubectl get hpa
This should produce an output equivalent to the following:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
sample-hpa Deployment/my-deployment 40%/50% 2 5 2 10m
Now that you have applied HPA to your cluster, the next step is to see how it reacts to fluctuating traffic. You can test this by intentionally increasing the load on your cluster. For example, with a web application, you can use a tool like Apache Bench (ab) to send a high volume of HTTP requests to your application.
As you increase the load, run the following command in a separate terminal window to monitor how HPA reacts in real time:
kubectl get hpa sample-hpa --watch
After a few seconds, you should be able to notice increased CPU usage. For example:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
sample-hpa Deployment/my-deployment 200%/50% 2 5 2 12m
A few moments later, you should notice the number of replicas increasing to cater to the increased demand. For example:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
sample-hpa Deployment/my-deployment 200%/50% 2 5 5 15m
After you notice HPA increase the number of pod replicas, stop generating the load. After a few seconds, you should be able to observe the CPU usage go down as well as the number of replicas. For example:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
sample-hpa Deployment/my-deployment 0%/50% 2 5 2 25m
These steps should allow you to verify that your HPA is performing as expected.
Note that the actual CPU usage and the increase in the number of replicas can vary based on different factors, such as the complexity of your application and the available resources on your system.
For most practical use cases, it’s important to use multiple metrics for HPA. This is due to the diverse ways in which applications can react to increased workloads. For example, a memory-intensive app may load significant data for processing without raising the CPU usage excessively. Another app may consume many CPU cycles without using extensive memory. Consider the following HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: multi-metric-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-deployment
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
In the above configuration, we are setting the target memory utilization to 70% and the target CPU utilization to 50%. This will trigger HPA to scale up or down based on values of both these metrics.
Custom metrics are a great way to implement granular, personalized scaling workflows. However, it’s important to note that custom metrics require an advanced, tailored monitoring setup in which the required custom metrics are aggregated and made available to HPA for consumption.
Kubernetes supports two main types of custom metrics: pod metrics and object metrics. Pod metrics are averaged across all of the pods in a deployment or StatefulSet and then compared with a target value to determine the desired replica count.
Object metrics are custom metrics that describe the performance of objects other than pods in the same namespace like Kubernetes Ingress or a Kubernetes Service. They can be used to measure any aspect of the object's performance, be it the number of requests it receives or the response latency.
Let’s look at a sample HPA configuration that uses both pod and object custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: multi-custom-metric-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-deployment
minReplicas: 2
maxReplicas: 5
metrics:
- type: Pods
pods:
metric:
name: network_latency
target:
type: AverageValue
averageValue: 20
- type: Object
object:
metric:
name: responses_per_second
describedObject:
kind: Service
name: service-name
target:
type: Value
value: 4k
In the above example, we are defining a pod metric by the name “network_latency” and an object metric by the name “responses_per_second”.
Kubernetes HPA offers unparalleled flexibility in autoscaling clusters. However, this degree of flexibility can occasionally result in misconfigurations and issues in scaling.
Here are some common HPA issues and errors:
FailedGetResourceMetric
This error means that the HPA controller is unable to fetch the required resource metrics from the server. The root cause is often a problem with the metrics server itself or with the HPA configuration.
Unable to get metrics for resource cpu
This error indicates that the HPA controller is unable to fetch CPU metrics for the target pods. The root cause for this problem can be an issue in the metrics server, incorrect permissions, or a misconfigured metric.
Desired replica count is outside of the specified range
This error indicates that the HPA controller has calculated a desired replica count that doesn’t fall in the range defined by the minimum and maximum number of replicas. This typically occurs due to an HPA misconfiguration or a sudden spike in resource usage.
Pods not scaling up or down
Sometimes you may notice that your cluster is not scaling up or down at all. An issue like this typically happens because of a misconfigured HPA controller or a problem with the pods themselves.
For the above issues, or any other HPA issues in general, you can use the following troubleshooting tips to identify and resolve the root cause:
In addition to the aforementioned troubleshooting tips, adhere to the following best practices to avoid common HPA issues and pitfalls:
Kubernetes HPA is a valuable tool to scale clusters up and down seamlessly based on built-in and custom metrics. It can improve performance, agility, availability, and resilience. In this article, we discussed how HPA works, how to configure and test it, how to troubleshoot common issues, and some best practices to follow while using it.
Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.
Apply Now