A Guide to Kuberenetes HPA (Horizontal Pod Autoscaling)

In containerized environments, dynamic, on-demand scaling is crucial to ensure optimal resource usage, performance, and cost-efficiency. Kubernetes offers a dedicated API resource for this purpose, named Kubernetes Horizontal Pod Autoscaling (HPA).

In this article, we will delve into Kubernetes HPA, covering its fundamental concepts, architecture, and practical implementation. From the initial setup and testing to troubleshooting common issues, our goal is to guide you through the entire process. By the end of this article, you should have a comprehensive understanding of how to leverage HPA to auto-scale your clusters.

What is Kubernetes Horizontal Pod Autoscaling (HPA)?

Kubernetes HPA is a controller that can auto-scale (both up and down) the number of pod replicas in a deployment or StatefulSet in real time, based on configured metrics. In simpler terms, it automates the horizontal scaling process by adding or removing pod replicas to match the desired performance metrics. This ensures that a cluster is able to cater to fluctuating workloads without over-provisioning or wasting workloads.

Practical example

Let’s suppose you have a web application running inside a Kubernetes cluster, and it experiences a sudden increase in traffic due to a promotion or an event. Without Kubernetes HPA, you'd need to manually increase the number of running pods to handle the higher load.

However, if you configure HPA, it will detect that your application pod’s CPU, memory, and other metrics are breaching certain thresholds and will automatically add more pod replicas to distribute the load. Once the traffic decreases, HPA will scale down the number of pods to prevent resource wastage.

Benefits of HPA

HPA offers several benefits for today’s complex and resource-intensive distributed infrastructures.

High performance: HPA ensures that applications have the resources they need to maintain consistent performance, even under fluctuating workloads.
Cost optimization: HPA can also seamlessly decrease the number of pods in a cluster during off-peak hours without downtime or disrupting performance. This helps reduce cloud computing costs associated with unused resources.
No need for manual scaling: Manual scaling is one of the hardest things to do in real time. The smallest of mistakes can lead to malfunctions and downtime. By automating scaling, HPA helps organizations avoid these pitfalls.
Enhanced fault tolerance: HPA is also activated in response to failures, which helps maintain high levels of application availability and resilience.
Scaling on custom metrics: HPA allows you to define custom metrics for scaling. You can use application-specific performance indicators or business-specific SLAs to implement a personalized and precise scaling strategy.

Kubernetes HPA vs. Vertical Pod Autoscaler (VPA) vs. Cluster Autoscaler

Kubernetes offers three primary autoscaling mechanisms: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler. Each approach serves a distinct purpose and addresses different scaling scenarios. Here’s a table summarizing their differences:

	HPA	VPA	Cluster Autoscaler
Purpose	Scales pods horizontally	Scales individual pods vertically	Scales nodes in a cluster
Mechanism	Adds or removes pods to a cluster based on predefined metrics	Adjusts the CPU and memory resource requests and limits for individual pods	Adds or removes nodes based on resource utilization across the cluster
Scaling unit	Pods	Pods (internally)	Nodes
Responsiveness	Real time	Less responsive than HPA	Slowest
Suitable for	Applications with fluctuating workloads and moderate resource requirements	Applications with unique resource requirements or those sensitive to pod startup time	Applications with dynamic workloads and high resource requirements

How does Kubernetes HPA work? Technical details

Kubernetes HPA uses an intermittent control loop to periodically check resource utilization and adjust the number of pods. By default, the control loop runs every 15 seconds, but you can set a custom interval by passing the “--horizontal-pod-autoscaler-sync-period” argument to the kube-controller-manager daemon.

During each iteration, the HPA controller gathers resource utilization metrics for the target resources defined in the HPA definitions, either from the resource metrics API for per-pod resource metrics or from the custom metrics API for other metrics.

For per-pod resource metrics (e.g., CPU), if a target utilization value is defined, the controller calculates the percentage of utilization based on the containers' resource requests in each pod. When a target raw value is set, it directly uses the raw metric values. Next, the controller computes the mean of either the utilization or raw values across all targeted Pods. This resulting ratio is then used to scale the desired number of replicas.

For per-pod custom metrics, the controller follows a process similar to per-pod resource metrics but uses raw values instead of utilization values. For object metrics and external metrics, a single metric describing the object is fetched and compared to the target value. The generated ratio is then used to adjust the number of replicas.

The algorithm

As per the official Kubernetes docs, the HPA uses the following algorithm:


desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

For example, consider a scenario where the current metric value is 300m and the desired value is 150m. In this case, the number of replicas adjusts to twice the current count as 300 / 150 equals 2. Similarly, if the current metric value were 75m, the replicas would be halved, as 75 / 150 equals 0.5.

Before performing any scaling actions, the control plane analyzes missing metrics and the readiness status of Pods. Pods with a deletion timestamp set or failed Pods are not considered in the calculations. Pods that don’t have metrics are ignored for the time being but may be examined later.

It’s important to note that the control plane doesn’t perform any scaling when the ratio is too close to 1.0. The configurable tolerance value defaults to 0.1.

The need for custom metrics

Custom metrics allow you to go beyond the built-in scaling metrics that Kubernetes offers. Let’s look at some reasons why they are important:

App-specific performance: Default resource metrics like CPU and memory utilization are a good place to start, but for modern applications, you often need more specific performance indicators. For example, an application that relies heavily on network I/O may not show high CPU or memory utilization even when it is under load. For such an app, you can use a custom metric like request throughput or network latency to drive scaling decisions.
Complex application architectures: Modern applications often have complex architectures with multiple components contributing to the overall performance. Custom metrics allow you to track and gauge the performance of individual components at a granular level.
External dependencies: Applications often rely on external services or resources like databases or APIs. When you use custom metrics, HPA is able to react to changes in the health or performance of these external dependencies.

We will explore how to set up custom metrics in a later section.

Setting up and testing HPA

To set up HPA in a cluster, we will have to define a YAML file that outlines the desired HPA configurations. Consider the following example:


 apiVersion: autoscaling/v2 
 kind: HorizontalPodAutoscaler 
 metadata: 
   name: sample-hpa 
 spec: 
   scaleTargetRef: 
     apiVersion: apps/v1 
     kind: Deployment 
     name: my-deployment 
   minReplicas: 2 
   maxReplicas: 5 
   metrics: 
   - type: Resource 
     resource: 
       name: cpu 
       target: 
         type: Utilization 
         averageUtilization: 50

In the above file, the “scaleTargetRef” field depicts the deployment on which the HPA is activated. The “minReplicas” and “maxReplicas” specify the minimum and maximum number of replicas. The metrics section defines that metric used for autoscaling, which is “cpu” in our case.

The “type” field of the metrics signifies that it’s a utilization metric, whereas the “averageUtilization” parameter sets the target utilization of the metric.

Once you have created the HPA configuration file, you can apply it to your cluster using this command:


kubectl apply -f hpa.yaml

To verify that the HPA has been created successfully, run this command:


kubectl get hpa

This should produce an output equivalent to the following:

 NAME      REFERENCE                TARGET    MINPODS   MAXPODS   REPLICAS    AGE  
 sample-hpa  Deployment/my-deployment  40%/50%     2         5         2         10m

Testing HPA

Now that you have applied HPA to your cluster, the next step is to see how it reacts to fluctuating traffic. You can test this by intentionally increasing the load on your cluster. For example, with a web application, you can use a tool like Apache Bench (ab) to send a high volume of HTTP requests to your application.

As you increase the load, run the following command in a separate terminal window to monitor how HPA reacts in real time:


 kubectl get hpa sample-hpa --watch

After a few seconds, you should be able to notice increased CPU usage. For example:


 NAME      REFERENCE            TARGET    MINPODS   MAXPODS   REPLICAS   AGE  
sample-hpa  Deployment/my-deployment 200%/50%              2     5     2      12m

A few moments later, you should notice the number of replicas increasing to cater to the increased demand. For example:


 NAME      REFERENCE            TARGET    MINPODS   MAXPODS   REPLICAS   AGE   
sample-hpa  Deployment/my-deployment 200%/50%              2     5     5      15m

After you notice HPA increase the number of pod replicas, stop generating the load. After a few seconds, you should be able to observe the CPU usage go down as well as the number of replicas. For example:


 NAME      REFERENCE            TARGET    MINPODS   MAXPODS   REPLICAS   AGE   
sample-hpa  Deployment/my-deployment 0%/50%              2     5    2      25m

These steps should allow you to verify that your HPA is performing as expected.

Note that the actual CPU usage and the increase in the number of replicas can vary based on different factors, such as the complexity of your application and the available resources on your system.

Autoscaling on multiple metrics

For most practical use cases, it’s important to use multiple metrics for HPA. This is due to the diverse ways in which applications can react to increased workloads. For example, a memory-intensive app may load significant data for processing without raising the CPU usage excessively. Another app may consume many CPU cycles without using extensive memory. Consider the following HPA configuration:

 
 apiVersion: autoscaling/v2 
 kind: HorizontalPodAutoscaler 
 metadata: 
   name: multi-metric-hpa 
 spec: 
   scaleTargetRef: 
     apiVersion: apps/v1 
     kind: Deployment 
     name: my-deployment 
   minReplicas: 2 
   maxReplicas: 5 
   metrics: 
   - type: Resource 
     resource: 
       name: cpu 
       target: 
         type: Utilization 
         averageUtilization: 50 
   - type: Resource 
     resource: 
       name: memory 
       target: 
         type: Utilization 
         averageUtilization: 70

In the above configuration, we are setting the target memory utilization to 70% and the target CPU utilization to 50%. This will trigger HPA to scale up or down based on values of both these metrics.

Autoscaling on custom metrics

Custom metrics are a great way to implement granular, personalized scaling workflows. However, it’s important to note that custom metrics require an advanced, tailored monitoring setup in which the required custom metrics are aggregated and made available to HPA for consumption.

Kubernetes supports two main types of custom metrics: pod metrics and object metrics. Pod metrics are averaged across all of the pods in a deployment or StatefulSet and then compared with a target value to determine the desired replica count.

Object metrics are custom metrics that describe the performance of objects other than pods in the same namespace like Kubernetes Ingress or a Kubernetes Service. They can be used to measure any aspect of the object's performance, be it the number of requests it receives or the response latency.

Let’s look at a sample HPA configuration that uses both pod and object custom metrics:


 apiVersion: autoscaling/v2 
 kind: HorizontalPodAutoscaler 
 metadata: 
   name: multi-custom-metric-hpa 
 spec: 
   scaleTargetRef: 
     apiVersion: apps/v1 
     kind: Deployment 
     name: my-deployment 
   minReplicas: 2 
   maxReplicas: 5 
   metrics: 
 - type: Pods 
   pods: 
     metric: 
       name: network_latency 
     target: 
       type: AverageValue 
       averageValue: 20 
 - type: Object 
   object: 
     metric: 
       name: responses_per_second 
    describedObject: 
      kind: Service 
      name: service-name 
   target: 
     type: Value 
     value: 4k

In the above example, we are defining a pod metric by the name “network_latency” and an object metric by the name “responses_per_second”.

HPA errors, troubleshooting tips, and best practices

Kubernetes HPA offers unparalleled flexibility in autoscaling clusters. However, this degree of flexibility can occasionally result in misconfigurations and issues in scaling.

Here are some common HPA issues and errors:

FailedGetResourceMetric

This error means that the HPA controller is unable to fetch the required resource metrics from the server. The root cause is often a problem with the metrics server itself or with the HPA configuration.

Unable to get metrics for resource cpu

This error indicates that the HPA controller is unable to fetch CPU metrics for the target pods. The root cause for this problem can be an issue in the metrics server, incorrect permissions, or a misconfigured metric.

Desired replica count is outside of the specified range

This error indicates that the HPA controller has calculated a desired replica count that doesn’t fall in the range defined by the minimum and maximum number of replicas. This typically occurs due to an HPA misconfiguration or a sudden spike in resource usage.

Pods not scaling up or down

Sometimes you may notice that your cluster is not scaling up or down at all. An issue like this typically happens because of a misconfigured HPA controller or a problem with the pods themselves.

Troubleshooting tips

For the above issues, or any other HPA issues in general, you can use the following troubleshooting tips to identify and resolve the root cause:

Start by identifying the issue. For example: Is the HPA not scaling up or down? Are the replicas not being created in a timely manner?
Look for relevant errors or warnings in the logs of the HPA controller, kubelet, kubeproxy, and any other relevant service.
Ensure that the metrics server is running and actively collecting metrics from pods.
Review your HPA configurations to ensure they are correct. Check for typos, invalid target values, and other errors.
Check the status of the pods in the target deployment. If the pods are not running or in an unhealthy state, the HPA may experience problems in scaling.
Check for resource limitations that may prevent the cluster from scaling. Use the kubectl top nodes to get information related to resource usage.
Consider external factors that may affect the scaling behavior, such as load balancers, network policies, or custom monitoring solutions.

HPA best practices

In addition to the aforementioned troubleshooting tips, adhere to the following best practices to avoid common HPA issues and pitfalls:

Start your HPA journey with fundamental metrics like CPU and memory utilization before adding more complex or custom metrics.
Set resource quotas. They can help prevent your HPAs from scaling too quickly and consuming all of the available resources in the cluster.
Regularly monitor resource utilization to identify abnormal patterns or unexpected spikes. This will allow you to take preemptive action before scaling issues arise. You can use Site24x7’s comprehensive Kubernetes monitoring tool for this purpose.
Implement gradual scaling policies rather than abrupt changes in the number of replicas to maintain stability and avoid sudden disruptions.
Conduct thorough testing of an HPA setup in a staging environment before deploying to production.
Document HPA configurations and maintain version control to track changes. This will enable easier troubleshooting and allow you to revert to an older version seamlessly if needed.

Conclusion

Kubernetes HPA is a valuable tool to scale clusters up and down seamlessly based on built-in and custom metrics. It can improve performance, agility, availability, and resilience. In this article, we discussed how HPA works, how to configure and test it, how to troubleshoot common issues, and some best practices to follow while using it.

Was this article helpful?

Sorry to hear that. Let us know how we can improve the article.

What is Kubernetes HPA? How to use, troubleshoot, and scale on custom metrics