Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of cloud-native.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Discover our events, webinars and other ways to connect.
Here’s what they’re saying about Komodor in the news.
Join the Komodor partner program and accelerate growth.
Kubernetes is built for elasticity. Containers scale in seconds, deployments roll out continuously, and the scheduler places workloads wherever capacity exists. But all of that agility would quickly hit a ceiling if the underlying nodes were static.
Manual provisioning can’t respond to weekend traffic spikes. Overallocating capacity to buy headroom burns budget. Underallocating risks incidents. For any team running production Kubernetes, automated node autoscaling is the key to unlock its full capabilities.
The question isn’t whether to automate node scaling, but how well you understand its constraints and what it’s quietly leaving on the table.
Cluster Autoscaler has been the default answer to node autoscaling since 2016, and works across AWS, GCP, and Azure.
The core mechanic is straightforward: it polls for pending pods roughly every 10 seconds. When it finds one, it simulates adding a node from each of your predefined node groups, identifies which groups could host the pod, and instructs the cloud provider API to increase the desired capacity of the winning group. The cloud spins up a new VM, it joins the cluster, and the scheduler takes over.
On the scale-down side, the logic is deliberately conservative. A node is only eligible for removal if all of the following conditions are met:
The priority is stability. Cost is secondary.
Cluster Autoscaler’s entire architecture sits on top of cloud VM group APIs (AWS ASGs, Azure VMSS) that were designed in 2009 to scale fleets of identical machines behind load balancers. Kubernetes didn’t exist yet. Cluster Autoscaler repurposes those primitives for pod scheduling. Every limitation downstream traces back to that mismatch.
Cluster Autoscaler can only scale groups you defined in advance. To cover real workload diversity, you need a separate group for every combination of architecture (x86 vs. ARM), capacity type (Spot vs. On-Demand), and node size. That’s numerous groups before you factor in availability zones, and each one is a resource your team owns permanently. Most teams cope by collapsing into a small number of broad, general-purpose groups, which sets up the next problem.
When your only groups are large general-purpose nodes, small pods pay full price for capacity they don’t use. A pod requesting 1 CPU and 2GB of memory might land on a c5.2xlarge with 8 CPUs. That’s roughly 88% of the instance idle and fully billed. Cluster Autoscaler simply chose the smallest group that could fit the pod. The waste is structural, not a misconfiguration.
Consolidation is purely about removal, never replacement. A node has to drop below 50% utilization and stay there for 10 minutes before Cluster Autoscaler touches it. In the example in Figure 1 below, it won’t attempt to merge Node C (30% utilization, one unevictable pod) with Node D (60% utilization) onto a single machine. Node C is pinned by a single pod. Node D is above threshold. The system sees neither as a candidate. A human looking at the same cluster would consolidate immediately. Cluster Autoscaler moves on.
Scale-up has to travel through the VM group API layer before it reaches the instance fleet API, which introduces overhead even on a clean path. Set your groups with prioritized fallback (Spot first, On-Demand second), and a failed Spot attempt has to time out entirely before the next group is tried. Stack a few groups and pods can sit pending for several minutes. Those cold starts quickly add up, threatening SLAs and putting customer-facing services in jeopardy.
Karpenter rethought the question rather than tuning the existing answer. Where Cluster Autoscaler asks “which of my predefined groups can fit this pod,” Karpenter asks “what does this pod actually need?”
It reads the full pod spec (CPU, memory, taints, topology spread, capacity type) and provisions the lowest-cost instance from the entire cloud catalog that satisfies those requirements. No predefined node groups. No VM group API in the middle. Karpenter calls the instance fleet API directly, cutting provisioning time from minutes to seconds.
Consolidation works differently too. Cluster Autoscaler removes underutilized nodes. Karpenter can replace them. An m5.2xlarge sitting at 30% utilization with pods consuming roughly 3 CPUs isn’t just a removal candidate. Karpenter can launch an m5.xlarge, migrate the pods, and drain the 2xlarge. The result is a cheaper node and denser packing, dynamically maintained as the cluster evolves.
Configuration is split cleanly across two objects. The NodePool declares intent: which instance families are allowed, disruption policies, resource limits. The NodeClass manages cloud mechanics: subnets, IAM roles, disk configuration. One NodeClass can back multiple NodePools, so you define cloud wiring once and reuse it.
A few settings that matter in practice:
Keep instance type requirements broad. Every constraint you add in NodePool requirements reduces Karpenter’s ability to bin-pack efficiently. Narrow requirements are the most common source of underperformance.
Choose your disruption policy deliberately. WhenEmpty only removes pods with no running workloads – safe, minimal disruption, lower savings. WhenUnderutilized repacks underused nodes onto fewer, cheaper alternatives – more savings, more pod movement. The right choice depends on your workload sensitivity.
Set ConsolidationAfter to match your traffic patterns. Zero seconds is maximally aggressive. For spiky workloads, a window of a few minutes prevents Karpenter from acting on transient dips that will recover on their own.
Karpenter is a genuine leap forward. It also has structural limits that configuration alone can’t fix.
Aggressive underutilized consolidation means frequent pod evictions. For stateless web workloads that tolerate restarts, this is manageable. For stateful workloads, long-running batch jobs, or CI/CD pipelines mid-execution, repeated evictions introduce real reliability risk. Tighter packing also creates noisy neighbor conditions where one pod that spikes steps on others sharing its node. The more aggressively you consolidate, the more exposure you carry. Teams end up softening their disruption settings to protect reliability, and those same settings are what keep capacity underutilized.
Karpenter is downstream of the Kubernetes scheduler. The scheduler places pods greedily. It finds a node that fits right now and places the pod there. It has no awareness of future cluster states or consolidation candidates. By the time Karpenter sees the cluster, placement decisions have already been made. Unevictable pods end up scattered across nodes Karpenter wants to drain, and there’s nothing Karpenter can do about it. The two systems optimize independently, and they frequently work against each other.
Even when a node is underutilized, it can’t be drained if any pod on it can’t relocate. Anti-affinity rules, topology spread constraints, or an unsatisfied PDB – any of these pins the node permanently. Teams add these guardrails for valid operational reasons. But they accumulate silently across a cluster until 30-40% of capacity is structurally stranded, trapped by constraints the autoscaler can’t work around.
Moving from Cluster Autoscaler to Karpenter solves the provisioning layer. It doesn’t solve what happens upstream, or the structural waste that accumulates inside a running cluster over time.
The remaining gap has three dimensions:
Pod resource requests are often wrong. Overprovisioned pods inflate node requirements and create headroom waste. Underprovisioned pods cause throttling and OOMs. Accurate rightsizing, informed by actual usage patterns, spike behavior, and QoS class, is a prerequisite for any autoscaler to work optimally.
The scheduler-autoscaler coordination problem is real. Komodor’s Predictive Placement addresses this by running continuous simulations of cluster drain states, classifying nodes by consolidation likelihood, and steering new pods away from drain candidates before bad placement happens. Unevictable pods get steered onto dedicated keeper nodes, concentrating blockers and freeing the rest of the cluster for consolidation. The scheduling layer stops working against the autoscaling layer.
Optimization blockers need to be surfaced and dealt with without affecting reliability. PDB-caused waste, excessive nodes from anti-affinity misconfiguration, non-terminated nodes blocking consolidation don’t show up in standard autoscaler metrics. Capacity Intelligence identifies them, quantifies their cost impact, and surfaces remediation in plain language.
Together, these capabilities extend optimization from the provisioning layer into the cluster itself, reaching the 30-40% of capacity that Karpenter, however well-tuned, structurally can’t reclaim.
Want a deeper look at how Predictive Placement works? Check out The Two-Sided Scheduling Problem.
Share:
Gain instant visibility into your clusters and resolve issues faster.
May 12 · 9:00EST / 15:00 CET · Live & Online
🎯 8+ Sessions 🎙️ 10+ Speakers ⚡ 100% Free
By registering you agree to our Privacy Policy. No spam. Unsubscribe anytime.
Check your inbox for a confirmation. We'll send session links closer to May 12.