What is Kubernetes Horizontal Pod Autoscaler (HPA)?

HPA is a Kubernetes component that automatically updates workload resources such as Deployments and StatefulSets, scaling them to match demand for applications in the cluster. Horizontal scaling means deploying more pods in response to increased load. It should not be confused with vertical scaling, which means allocating more Kubernetes node resources (such as memory and CPU) to pods that are already running.

When load decreases and the number of pods exceeds the configured minimum, HPA notifies the workload resource, for example the Deployment object, to scale down.

How HPA Works: Kubernetes HPA Example

Horizontal pod autoscaling has been a feature of Kubernetes since version 1.1, meaning that it is a highly mature and stable API. However, the API objects used to manage HPA have changed over time: 

  • The original implementation could only scale pods based on the difference between the desired and observed CPU utilization metrics. These simple metrics were collected using the deprecated Heapster collector. 
  • In the V2 API, the HPA was upgraded to support custom metrics, as well as metrics from objects not related to Kubernetes. This lets you scale workloads based on metrics like HTTP request throughput or the size of the message queue.

You can define scaling characteristics for your workloads in the HorizontalPodAutoscaler YAML configuration. You can create a configuration for each workload or group of workloads. Here is an example:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
  name: my-app-hpa
  minReplicas: 2
  maxReplicas: 10
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  - type: Resource
      name: cpu
        type: Utilization
        averageUtilization: 70

A few important points about this configuration:

  • It targets the my-app Deployment object and scales it between 2 and 10 replicas, depending on load.
  • Load is measured by CPU utilization. HPA will add or remove pods until the average pod in the deployment utilizes 70% of CPU on its node. If the average utilization is higher, it will add pods, and if it is lower than 70%, it will scale down pods.

You can use other metrics, such as memory utilization, instead of CPU. You can also define several metrics to represent application load, and the HPA algorithm adjusts the number of pods to satisfy the most demanding metric across all pods.

Related content: Read our guide to Kubernetes Cluster Autoscaler

Drawbacks of HPA

Although HPA is a powerful tool, it is not suitable for all use cases and does not solve all cluster resource problems. Here are a few examples of use cases that are less suitable for HPA:

  • HPA cannot be used together with Vertical Pod Autoscaler (VPA) on the same metrics. However you can combine them by using custom metrics for HPA.
  • HPA is only suitable for stateless applications that support parallel execution, or StatefulSets that provide persistence for stateful applications.
  • The HPA algorithm does not take IOPS, network bandwidth, and storage into account when scaling, so in some cases it could affect application performance or lead to crashes.
  • HPA can detect under-utilization at the node level, but not at the container level. If containers running within a pod have unused requested resources, HPA cannot detect this, and you will need third-party tooling to identify this type of wasted resources.

Kubernetes Autoscaling: Horizontal Pod Autoscaler vs Kubernetes Vertical Pod Autoscaler

The primary difference between HPA and VPA is the scaling method: HPA scales by adding or removing pods, while VPA scales by allocating additional CPU and memory resources to existing pod containers, or reducing the resource available to them. Another difference is that VPA only supports CPU and memory as scaling metrics, while HPA supports additional custom metrics.

HPA and VPA are not necessarily two competing options – they are often used together. This can help you achieve a good balance, optimally distributing workloads across available nodes, while fully utilizing the computing resources of each node. 

However, you should be aware that HPA and VPA can conflict with each other – for example, if they both use memory as the scaling metric, they can try to scale workloads vertically and horizontally at the same time, which can have unpredictable consequences. To avoid such conflict, make sure that each mechanism uses different metrics. Typically, you will set VPA to scale based on CPU or memory, and use custom metrics for HPA.

Troubleshooting Kubernetes HPA

Insufficient Time to Scale

A common challenge with HPA is that it takes time to scale up a workload by adding another pod. Loads can sometimes change sharply, and during the time it takes to scale up, the existing pod can reach 100% utilization, resulting in service degradation and failures.

For example, consider a pod that can handle 800 requests with under 80% CPU utilization, and HPA is configured to scale up when the 80% CPU threshold is reached. Let’s say it takes 10 seconds for the new pod to start up. 

If loads increase by 100 requests per second, the pod will reach 100% utilization within 2 seconds, while it takes 8 more seconds for the second pod to start receiving requests. 

Possible solutions

  • Reducing the scaling threshold to keep a safety margin, so that each pod has some spare capacity to deal with sudden traffic spikes. Keep in mind that this has a cost, which is multiplied by the number of pods running your application.
  • Always keeping one extra pod in reserve to account for sudden traffic spikes.

Brief Spikes in Load

When a workload experiences brief spikes in CPU utilization (or any other scaling metrics), you might expect that HPA will immediately spin up an additional pod. However, if the spikes are short enough, this will not happen. 

To understand why, consider that: 

  • When an event like high CPU utilization happens, HPA does not directly receive the event from the pod. 
  • HPA polls for metrics every few seconds from the Kubernetes Metrics Server (unless you have integrated a custom component).
  • The Kubernetes Metrics Server polls aggregate metrics from pods.
  • The --metric-resolution flag specifies the time window that is evaluated, typically 30 seconds. 

For example, assume HPA is set to scale when CPU utilization exceeds 80%. If CPU utilization suddenly spikes to 90%, but this occurs for only 2 seconds out of a 30 second metric resolution window, and in the rest of the 30-second period utilization is 20%, the average utilization is: 

(2 * 90% + 28 * 90%) / 30 = 27%

When HPA polls for the CPU utilization metric, it will observe a metric of 27%, which is not even close to the scaling threshold of 80%. This means HPA will not scale – even though in reality, the workload experienced high load.

Possible solutions

  • Increase metric resolution—you can set the --metric-resolution flag to a lower number. However, this might cause unwanted scaling events because HPA will become much more sensitive to changes in load.
  • Use burstable QoS on the pod—if you set the limits parameter significantly higher than the requests parameter (for example, 3-4 times higher), in the example above more resources will be allocated to the pod, if available. This can preclude the need to scale horizontally using HPA. This solution does not guarantee scaling, and also risks that the pod will be evicted from the node due to resource pressure.
  • Combine HPA with VPA—if you expect resources to be available on the node to provide more resources in case of brief spikes in load, you can use VPA in combination with HPA. Make sure to configure VPA on a separate metric from HPA, and one that immediately responds to increased loads.

Scaling Delay Due to Application Readiness

It can often happen that HPA correctly issues a scaling request, but for various reasons, it takes time for the new container to be up and running. These reasons can include:

  • Image downloads—some images are large and network conditions might result in a long download time.
  • Initialization procedures—some applications require a complex initialization or warmup, and while they are taking place, they cannot serve loads.
  • Readiness checks—a pod might have readiness checks such as initialDelaySeconds, meaning that Kubernetes will not send traffic to the pod until the delay is over, even if in reality the container is ready for work.

Related content: Read our guide to Readiness Probes

Possible solutions

  • Keep container images small
  • Keep the initialization procedures short
  • Identify Kubernetes readiness checks and ensure they are not overly strict

Excessive Scaling

In some cases, HPA might scale an application so much that it could consume almost all the resources in the cluster. You could set up HPA in combination with Cluster Autoscaler to automatically add more nodes to the cluster. However, this might sometimes get out of hand. 

Consider these scenarios:

  • A denial of service (DoS) attack in which an application is flooded with fake traffic. 
  • An application is experiencing high loads, but is not mission critical and the organization cannot invest in additional resources and scale up the cluster.
  • The application is using excessive resources due to a misconfiguration or design issue, which should be resolved instead of automatically scaling it on demand.

In these, and many similar scenarios, it is better not to scale the application beyond a certain limit. However, HPA does not know this and will continue to scale the application even when this does not make business sense.

Possible solutions

The best solution is to limit the number of replicas that can be created by HPA. You can define this in the spec:maxReplicas field of the HPA configuration:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
  name: my-app-hpa
  minReplicas: 2
  maxReplicas: 10

In this configuration, maxReplicas is set to 10. Calculate the maximum expected load of your application and ensure you set a realistic maximal scale, with some buffer for surprise peaks in traffic.

Horizontal Pod Autoscaler Troubleshooting with Komodor

Kubernetes troubleshooting relies on the ability to quickly contextualize the problem with what’s happening in the rest of the cluster. More often than not, you will be conducting your investigation during fires in production. The major challenge is correlating service-level incidents with other events happening in the underlying infrastructure.

When using Horizontal Pod Autoscaler, there can be a variety of issues related to existing nodes or new nodes added to the cluster. Komodor can help with our new ‘Node Status’ view, built to pinpoint correlations between service or deployment issues and changes in the underlying node infrastructure. With this view you can rapidly:

  • See service-to-node associations
  • Correlate service and node health issues
  • Gain visibility over node capacity allocations, restrictions, and limitations
  • Identify “noisy neighbors” that use up cluster resources
  • Keep track of changes in managed clusters
  • Get fast access to historical node-level event data

Beyond node error remediations, Komodor can help troubleshoot a variety of Kubernetes errors and issues, acting as a single source of truth (SSOT) for all of your K8s troubleshooting needs. Komodor provides:

  • Change intelligence: Every issue is a result of a change. Within seconds we can help you understand exactly who did what and when. 
  • In-depth visibility: A complete activity timeline, showing all code and config changes, deployments, alerts, code diffs, pod logs and etc. All within one pane of glass with easy drill-down options.
  • Insights into service dependencies: An easy way to understand cross-service changes and visualize their ripple effects across your entire system. 
  • Seamless notifications: Direct integration with your existing communication channels (e.g., Slack) so you’ll have all the information you need, when you need it.

If you are interested in checking out Komodor, use this link to sign up for a Free Trial.