Beyond Karpenter: The True Limits of Node Autoscaling

Kubernetes is built for elasticity. Containers scale in seconds, deployments roll out continuously, and the scheduler places workloads wherever capacity exists. But all of that agility would quickly hit a ceiling if the underlying nodes were static.

Manual provisioning can’t respond to weekend traffic spikes. Overallocating capacity to buy headroom burns budget. Underallocating risks incidents. For any team running production Kubernetes, automated node autoscaling is the key to unlock its full capabilities.

The question isn’t whether to automate node scaling, but how well you understand its constraints and what it’s quietly leaving on the table.

Cluster Autoscaler: The Bridge That Predates Kubernetes

Cluster Autoscaler has been the default answer to node autoscaling since 2016, and works across AWS, GCP, and Azure.

The core mechanic is straightforward: it polls for pending pods roughly every 10 seconds. When it finds one, it simulates adding a node from each of your predefined node groups, identifies which groups could host the pod, and instructs the cloud provider API to increase the desired capacity of the winning group. The cloud spins up a new VM, it joins the cluster, and the scheduler takes over.

On the scale-down side, the logic is deliberately conservative. A node is only eligible for removal if all of the following conditions are met:

  1. Utilization drops below 50%
  2. Every pod on it can relocate to another node
  3. No pod has a PDB or annotation preventing eviction
  4. All of those conditions hold for at least 10 minutes. 

The priority is stability. Cost is secondary.

Cluster Autoscaler’s entire architecture sits on top of cloud VM group APIs (AWS ASGs, Azure VMSS) that were designed in 2009 to scale fleets of identical machines behind load balancers. Kubernetes didn’t exist yet. Cluster Autoscaler repurposes those primitives for pod scheduling. Every limitation downstream traces back to that mismatch.

Where Cluster Autoscaler Runs Out of Road

Operational overhead

Cluster Autoscaler can only scale groups you defined in advance. To cover real workload diversity, you need a separate group for every combination of architecture (x86 vs. ARM), capacity type (Spot vs. On-Demand), and node size. That’s numerous groups before you factor in availability zones, and each one is a resource your team owns permanently. Most teams cope by collapsing into a small number of broad, general-purpose groups, which sets up the next problem.

Structural waste

When your only groups are large general-purpose nodes, small pods pay full price for capacity they don’t use. A pod requesting 1 CPU and 2GB of memory might land on a c5.2xlarge with 8 CPUs. That’s roughly 88% of the instance idle and fully billed. Cluster Autoscaler simply chose the smallest group that could fit the pod. The waste is structural, not a misconfiguration.

Reactive scale-down

Consolidation is purely about removal, never replacement. A node has to drop below 50% utilization and stay there for 10 minutes before Cluster Autoscaler touches it. In the example in Figure 1 below, it won’t attempt to merge Node C (30% utilization, one unevictable pod) with Node D (60% utilization) onto a single machine. Node C is pinned by a single pod. Node D is above threshold. The system sees neither as a candidate. A human looking at the same cluster would consolidate immediately. Cluster Autoscaler moves on.

Provisioning latency

Scale-up has to travel through the VM group API layer before it reaches the instance fleet API, which introduces overhead even on a clean path. Set your groups with prioritized fallback (Spot first, On-Demand second), and a failed Spot attempt has to time out entirely before the next group is tried. Stack a few groups and pods can sit pending for several minutes. Those cold starts quickly add up, threatening SLAs and putting customer-facing services in jeopardy.

Figure 1 – How Cluster Autoscaler works on scaledown

Karpenter: Pod-First Provisioning

Karpenter rethought the question rather than tuning the existing answer. Where Cluster Autoscaler asks “which of my predefined groups can fit this pod,” Karpenter asks “what does this pod actually need?”

It reads the full pod spec (CPU, memory, taints, topology spread, capacity type) and provisions the lowest-cost instance from the entire cloud catalog that satisfies those requirements. No predefined node groups. No VM group API in the middle. Karpenter calls the instance fleet API directly, cutting provisioning time from minutes to seconds.

Consolidation works differently too. Cluster Autoscaler removes underutilized nodes. Karpenter can replace them. An m5.2xlarge sitting at 30% utilization with pods consuming roughly 3 CPUs isn’t just a removal candidate. Karpenter can launch an m5.xlarge, migrate the pods, and drain the 2xlarge. The result is a cheaper node and denser packing, dynamically maintained as the cluster evolves.

Configuration is split cleanly across two objects. The NodePool declares intent: which instance families are allowed, disruption policies, resource limits. The NodeClass manages cloud mechanics: subnets, IAM roles, disk configuration. One NodeClass can back multiple NodePools, so you define cloud wiring once and reuse it.

A few settings that matter in practice:

Keep instance type requirements broad. Every constraint you add in NodePool requirements reduces Karpenter’s ability to bin-pack efficiently. Narrow requirements are the most common source of underperformance.

Choose your disruption policy deliberately. WhenEmpty only removes pods with no running workloads – safe, minimal disruption, lower savings. WhenUnderutilized repacks underused nodes onto fewer, cheaper alternatives – more savings, more pod movement. The right choice depends on your workload sensitivity.

Set ConsolidationAfter to match your traffic patterns. Zero seconds is maximally aggressive. For spiky workloads, a window of a few minutes prevents Karpenter from acting on transient dips that will recover on their own.

Where Karpenter Still Falls Short

Karpenter is a genuine leap forward. It also has structural limits that configuration alone can’t fix.

The cost-reliability tradeoff

Aggressive underutilized consolidation means frequent pod evictions. For stateless web workloads that tolerate restarts, this is manageable. For stateful workloads, long-running batch jobs, or CI/CD pipelines mid-execution, repeated evictions introduce real reliability risk. Tighter packing also creates noisy neighbor conditions where one pod that spikes steps on others sharing its node. The more aggressively you consolidate, the more exposure you carry. Teams end up softening their disruption settings to protect reliability, and those same settings are what keep capacity underutilized.

The upstream scheduling problem

Karpenter is downstream of the Kubernetes scheduler. The scheduler places pods greedily. It finds a node that fits right now and places the pod there. It has no awareness of future cluster states or consolidation candidates. By the time Karpenter sees the cluster, placement decisions have already been made. Unevictable pods end up scattered across nodes Karpenter wants to drain, and there’s nothing Karpenter can do about it. The two systems optimize independently, and they frequently work against each other.

The guardrail trap

Even when a node is underutilized, it can’t be drained if any pod on it can’t relocate. Anti-affinity rules, topology spread constraints, or an unsatisfied PDB – any of these pins the node permanently. Teams add these guardrails for valid operational reasons. But they accumulate silently across a cluster until 30-40% of capacity is structurally stranded, trapped by constraints the autoscaler can’t work around.

What Closing the Gap Actually Requires

Moving from Cluster Autoscaler to Karpenter solves the provisioning layer. It doesn’t solve what happens upstream, or the structural waste that accumulates inside a running cluster over time.

The remaining gap has three dimensions:

Pod resource requests are often wrong. Overprovisioned pods inflate node requirements and create headroom waste. Underprovisioned pods cause throttling and OOMs. Accurate rightsizing, informed by actual usage patterns, spike behavior, and QoS class, is a prerequisite for any autoscaler to work optimally.

The scheduler-autoscaler coordination problem is real. Komodor’s Predictive Placement addresses this by running continuous simulations of cluster drain states, classifying nodes by consolidation likelihood, and steering new pods away from drain candidates before bad placement happens. Unevictable pods get steered onto dedicated keeper nodes, concentrating blockers and freeing the rest of the cluster for consolidation. The scheduling layer stops working against the autoscaling layer.

Optimization blockers need to be surfaced and dealt with without affecting reliability. PDB-caused waste, excessive nodes from anti-affinity misconfiguration, non-terminated nodes blocking consolidation don’t show up in standard autoscaler metrics. Capacity Intelligence identifies them, quantifies their cost impact, and surfaces remediation in plain language.

Together, these capabilities extend optimization from the provisioning layer into the cluster itself, reaching the 30-40% of capacity that Karpenter, however well-tuned, structurally can’t reclaim.


Want a deeper look at how Predictive Placement works? Check out The Two-Sided Scheduling Problem.