What is Kubernetes Horizontal Pod Autoscaler (HPA)?

HPA is a Kubernetes component that automatically updates workload resources such as Deployments and StatefulSets, scaling them to match demand for applications in the cluster. Horizontal scaling means deploying more pods in response to increased load. It should not be confused with vertical scaling, which means allocating more Kubernetes node resources (such as memory and CPU) to pods that are already running.

When load decreases and the number of pods exceeds the configured minimum, HPA notifies the workload resource, for example the Deployment object, to scale down.

Kubernetes HPA Use Cases

HPA is widely used in scenarios where dynamic scaling is required. Here are some common use cases:

  1. Microservices architectures: For microservices deployed in a Kubernetes cluster, HPA can independently scale each service based on its specific load and performance metrics, ensuring optimal resource utilization and performance.
  2. Web applications with variable traffic: For applications like eCommerce sites that experience traffic spikes during sales events, HPA can automatically adjust the number of pods to handle the load, ensuring high availability and performance.
  3. Batch processing jobs: In scenarios where jobs are queued and need to be processed in parallel, HPA can scale the number of worker pods to process the jobs faster based on the queue length or other custom metrics.
  4. CI/CD pipelines: Continuous Integration/Continuous Deployment pipelines often have varying workloads. HPA can scale the necessary pods for build and test processes, ensuring faster execution times without manual intervention.
  5. IoT and data streaming applications: For applications processing data streams from IoT devices or other real-time data sources, HPA can adjust the processing capacity based on the incoming data rate, ensuring that data is processed without delays.

How Does Kubernetes HPA Calculate the Replica Count?

HorizontalPodAutoscaler calculates the number of replicas required for a deployment based on the metrics in the HPA configuration. Here is a step-by-step explanation of the calculation process:

  1. Metrics collection: Gathers real-time metrics from the Kubernetes metrics server. These metrics can be CPU utilization, memory usage, or custom metrics provided by external monitoring systems.
  2. Desired replicas calculation: Uses a formula to calculate the desired number of replicas. For example, when scaling based on CPU utilization, the formula is:

Desired Replicas=(Current CPU Utilization/Target CPU Utilization​)×Current Replicas

If the current CPU utilization is higher than the target, more replicas will be needed to balance the load. If it’s lower, fewer replicas will suffice.

  1. Smoothing and stabilization: To avoid rapid scaling up and down (thrashing), HPA employs a stabilization window. This means that changes in the number of replicas are smoothed over a certain period. This window is configurable to ensure that the scaling actions are stable and do not lead to performance issues.
  2. Scaling limits: The HPA configuration specifies the minimum and maximum number of replicas. Even if the calculated desired replicas fall outside these bounds, HPA will not scale beyond the specified limits.

Cooldown periods: These prevent scaling actions from occurring too frequently. This helps in stabilizing the application performance and avoiding unnecessary resource allocation.

Kubernetes HPA Example

Horizontal pod autoscaling has been a feature of Kubernetes since version 1.1, meaning that it is a highly mature and stable API. However, the API objects used to manage HPA have changed over time.  In the V2 API, the HPA was upgraded to support custom metrics, as well as metrics from objects not related to Kubernetes. This lets you scale workloads based on metrics like HTTP request throughput or the size of the message queue.

You can define scaling characteristics for your workloads in the HorizontalPodAutoscaler YAML configuration. You can create a configuration for each workload or group of workloads. Here is an example:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
minReplicas: 2
maxReplicas: 10
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

A few important points about this configuration:

  • It targets the my-app Deployment object and scales it between 2 and 10 replicas, depending on load.
  • Load is measured by CPU utilization. HPA will add or remove pods until the average pod in the deployment utilizes 70% of CPU on its node. If the average utilization is higher, it will add pods, and if it is lower than 70%, it will scale down pods.

You can use other metrics, such as memory utilization, instead of CPU. You can also define several metrics to represent application load, and the HPA algorithm adjusts the number of pods to satisfy the most demanding metric across all pods.

Related content: Read our guide to Kubernetes Cluster Autoscaler

Drawbacks of HPA

Although HPA is a powerful tool, it is not suitable for all use cases and does not solve all cluster resource problems. Here are a few examples of use cases that are less suitable for HPA:

  • HPA cannot be used together with Vertical Pod Autoscaler (VPA) on the same metrics. However you can combine them by using custom metrics for HPA.
  • HPA is only suitable for stateless applications that support parallel execution, or StatefulSets that provide persistence for stateful applications.
  • The HPA algorithm does not take IOPS, network bandwidth, and storage into account when scaling, so in some cases it could affect application performance or lead to crashes.
  • HPA can detect under-utilization at the node level, but not at the container level. If containers running within a pod have unused requested resources, HPA cannot detect this, and you will need third-party tooling to identify this type of wasted resources.

Kubernetes Autoscaling: Horizontal Pod Autoscaler vs Kubernetes Vertical Pod Autoscaler

The primary difference between HPA and VPA is the scaling method: HPA scales by adding or removing pods, while VPA scales by allocating additional CPU and memory resources to existing pod containers, or reducing the resource available to them. Another difference is that VPA only supports CPU and memory as scaling metrics, while HPA supports additional custom metrics.

HPA and VPA are not necessarily two competing options – they are often used together. This can help you achieve a good balance, optimally distributing workloads across available nodes, while fully utilizing the computing resources of each node. 

However, you should be aware that HPA and VPA can conflict with each other – for example, if they both use memory as the scaling metric, they can try to scale workloads vertically and horizontally at the same time, which can have unpredictable consequences. To avoid such conflict, make sure that each mechanism uses different metrics. Typically, you will set VPA to scale based on CPU or memory, and use custom metrics for HPA.

Best Practices to Optimize Kubernetes Horizontal Pod Autoscaler

Here are some of the ways you can ensure the most effective use of HPA in Kubernetes.

Set Appropriate Metrics

Start with CPU and memory utilization, as they are built-in and widely supported. To tailor HPA to your application’s needs, consider adding custom metrics that reflect your application’s workload more accurately. For example, web applications might benefit from metrics like HTTP request rates, response times, or the number of active sessions. 

For data processing applications, metrics such as the length of the message queue or processing time per task might be more appropriate. Custom metrics can be integrated using tools like Prometheus, which exports metrics from your application to the Kubernetes metrics server. Ensure that your monitoring system is reliable and that the metrics are collected and reported with low latency. 

Configure Min and Max Replicas

Configuring the minimum and maximum number of replicas ensures that your application maintains a balance between performance and resource usage. The minimum number of replicas should be set to handle the baseline traffic and to provide fault tolerance. For example, if your application should always be highly available, set a higher minimum number of replicas.

The maximum number of replicas should be based on the maximum expected load and the capacity of your Kubernetes cluster. If you set the maximum too low, the application might not scale enough to handle peak traffic, leading to performance degradation. Setting it too high can exhaust cluster resources. Analyze historical traffic patterns and resource utilization to decide.

Leverage Scaling Delays

Scaling delays prevent HPA from reacting too quickly to short-term fluctuations, a phenomenon known as thrashing. Thrashing can lead to unnecessary scaling actions, increased resource consumption, and instability. Kubernetes provides parameters like
--horizontal-pod-autoscaler-upscale-delay and
--horizontal-pod-autoscaler-downscale-delay to control the delay before applying scaling actions.

The upscale delay should be set to allow the system to confirm that increased load is sustained before adding new pods. The downscale delay should be long enough to ensure that a drop in metrics is not temporary, avoiding premature pod termination.

Monitor and Adjust Thresholds Regularly

Monitoring requires continuous effort to ensure your HPA configuration remains effective. Use tools like Prometheus and Grafana to visualize performance metrics and identify trends. Regularly review and analyze these metrics to understand how your application behaves under different loads.

Adjust the target utilization thresholds based on this analysis. For example, if the current target CPU utilization is set at 70%, but you notice that performance issues arise when utilization exceeds 60%, lower the threshold to 60%. If you find that your application is underutilized, raising the threshold might reduce unnecessary scaling actions.

Use Predictive Scaling

Predictive scaling involves anticipating future load based on historical data and known events. Tools like the Kubernetes Event-Driven Autoscaler (KEDA) allow you to scale applications based on event data from various sources, including message queues, databases, or custom metrics.

For example, if you know that your web application experiences a traffic spike every day at noon, predictive scaling can proactively increase the number of pods just before the spike occurs. This helps maintain performance and avoid the latency associated with reactive scaling.

Optimize Resource Requests and Limits

Resource requests specify the minimum resources required for a pod, influencing how Kubernetes schedules the pod. Limits define the maximum resources a pod can use, preventing it from consuming more than its fair share and potentially impacting other pods.

To set accurate resource requests and limits, profile your application to understand its resource usage under different loads. Use this data to configure requests that reflect typical usage and limits that accommodate peak usage without causing resource contention. This ensures that HPA scales your application based on realistic resource needs.

Integrate with Cluster Autoscaler

The Cluster Autoscaler adjusts the number of nodes in your cluster based on resource requirements, adding nodes when pods cannot be scheduled due to resource constraints and removing nodes when they are underutilized. To integrate effectively, configure the Cluster Autoscaler to work alongside your HPA. 

Ensure that the autoscaler settings, such as the minimum and maximum number of nodes, align with your cluster’s capacity and workload demands. This integration prevents situations where the HPA cannot scale pods due to insufficient node resources, maintaining application performance and availability.

Fine-Tune Readiness and Liveness Probes for Smooth Scaling

Readiness probes and liveness probes help ensure that only healthy pods receive traffic and that unresponsive pods are restarted. Properly configured probes help maintain application stability during scaling operations.

Readiness probes determine if a pod is ready to handle requests. They should be configured to reflect the actual conditions required for your application to serve traffic, such as successful connections to dependent services or initialization of necessary resources. This ensures that new pods are not added to the load balancer until they are fully ready.

Liveness probes detect if a pod is still running and responsive. They should be configured to catch conditions where the pod is stuck or unresponsive, triggering a restart to restore functionality. Fine-tuning these probes helps avoid disruptions and ensures smooth scaling transitions.

Troubleshooting Kubernetes HPA

Insufficient Time to Scale

A common challenge with HPA is that it takes time to scale up a workload by adding another pod. Loads can sometimes change sharply, and during the time it takes to scale up, the existing pod can reach 100% utilization, resulting in service degradation and failures.

For example, consider a pod that can handle 800 requests with under 80% CPU utilization, and HPA is configured to scale up when the 80% CPU threshold is reached. Let’s say it takes 10 seconds for the new pod to start up. 

If loads increase by 100 requests per second, the pod will reach 100% utilization within 2 seconds, while it takes 8 more seconds for the second pod to start receiving requests. 

Possible solutions

  • Reducing the scaling threshold to keep a safety margin, so that each pod has some spare capacity to deal with sudden traffic spikes. Keep in mind that this has a cost, which is multiplied by the number of pods running your application.
  • Always keeping one extra pod in reserve to account for sudden traffic spikes.

Brief Spikes in Load

When a workload experiences brief spikes in CPU utilization (or any other scaling metrics), you might expect that HPA will immediately spin up an additional pod. However, if the spikes are short enough, this will not happen. 

To understand why, consider that: 

  • When an event like high CPU utilization happens, HPA does not directly receive the event from the pod. 
  • HPA polls for metrics every few seconds from the Kubernetes Metrics Server (unless you have integrated a custom component).
  • The Kubernetes Metrics Server polls aggregate metrics from pods.
  • The --metric-resolution flag specifies the time window that is evaluated, typically 30 seconds. 

For example, assume HPA is set to scale when CPU utilization exceeds 80%. If CPU utilization suddenly spikes to 90%, but this occurs for only 2 seconds out of a 30 second metric resolution window, and in the rest of the 30-second period utilization is 20%, the average utilization is: 

(2 * 90% + 28 * 90%) / 30 = 27%

When HPA polls for the CPU utilization metric, it will observe a metric of 27%, which is not even close to the scaling threshold of 80%. This means HPA will not scale – even though in reality, the workload experienced high load.

Possible solutions

  • Increase metric resolution—you can set the --metric-resolution flag to a lower number. However, this might cause unwanted scaling events because HPA will become much more sensitive to changes in load.
  • Use burstable QoS on the pod—if you set the limits parameter significantly higher than the requests parameter (for example, 3-4 times higher), in the example above more resources will be allocated to the pod, if available. This can preclude the need to scale horizontally using HPA. This solution does not guarantee scaling, and also risks that the pod will be evicted from the node due to resource pressure.
  • Combine HPA with VPA—if you expect resources to be available on the node to provide more resources in case of brief spikes in load, you can use VPA in combination with HPA. Make sure to configure VPA on a separate metric from HPA, and one that immediately responds to increased loads.

Related content: Read our guide to Readiness Probes

Possible solutions

  • Keep container images small
  • Keep the initialization procedures short
  • Identify Kubernetes readiness checks and ensure they are not overly strict

Excessive Scaling

In some cases, HPA might scale an application so much that it could consume almost all the resources in the cluster. You could set up HPA in combination with Cluster Autoscaler to automatically add more nodes to the cluster. However, this might sometimes get out of hand. 

Consider these scenarios:

  • A denial of service (DoS) attack in which an application is flooded with fake traffic. 
  • An application is experiencing high loads, but is not mission critical and the organization cannot invest in additional resources and scale up the cluster.
  • The application is using excessive resources due to a misconfiguration or design issue, which should be resolved instead of automatically scaling it on demand.

In these, and many similar scenarios, it is better not to scale the application beyond a certain limit. However, HPA does not know this and will continue to scale the application even when this does not make business sense.

Possible solutions

The best solution is to limit the number of replicas that can be created by HPA. You can define this in the spec:maxReplicas field of the HPA configuration:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
minReplicas: 2
maxReplicas: 10

In this configuration, maxReplicas is set to 10. Calculate the maximum expected load of your application and ensure you set a realistic maximal scale, with some buffer for surprise peaks in traffic.

Horizontal Pod Autoscaler Troubleshooting with Komodor

Kubernetes troubleshooting relies on the ability to quickly contextualize the problem with what’s happening in the rest of the cluster. More often than not, you will be conducting your investigation during fires in production. The major challenge is correlating service-level incidents with other events happening in the underlying infrastructure.

Komodor can help with its ‘Node Status’ view, built to pinpoint correlations between service or deployment issues and changes in the underlying node infrastructure. With this view you can rapidly:

  • See service-to-node associations
  • Correlate service and node health issues
  • Gain visibility over node capacity allocations, restrictions, and limitations
  • Identify “noisy neighbors” that use up cluster resources
  • Keep track of changes in managed clusters
  • Get fast access to historical node-level event data

Beyond node error remediations, Komodor can help troubleshoot a variety of Kubernetes errors and issues. As the leading Continuous Kubernetes Reliability Platform, Komodor is designed to democratize K8s expertise across the organization and enable engineering teams to leverage its full value.

Komodor’s platform empowers developers to confidently monitor and troubleshoot their workloads while allowing cluster operators to enforce standardization and optimize performance. Specifically when working in a hybrid environment, Komodor reduces the complexity by providing a unified view of all your services and clusters.

By leveraging Komodor, companies of all sizes significantly improve reliability, productivity, and velocity. Or, to put it simply – Komodor helps you spend less time and resources on managing Kubernetes, and more time on innovating at scale.

If you are interested in checking out Komodor, use this link to sign up for a Free Trial

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.