Home
Komodor Blog
Kubernetes v1.35: The Release That Tackles the Industry’s $100 Billion Waste Problem

Kubernetes v1.35: The Release That Tackles the Industry’s $100 Billion Waste Problem

Itiel Shwartz, CTO & co-founder

8 min read December 29th, 2025

The Cluster Utilization Crisis Nobody Talks About

Kubernetes v1.35 dropped a couple of weeks ago, and while the headlines focus on gang scheduling and in-place resizing going GA, there’s a bigger story here that every platform team needs to understand: Kubernetes is finally acknowledging that cluster utilization is fundamentally broken.

At Komodor, we work with hundreds of organizations running Kubernetes at scale. Across every customer, every vertical, every cloud provider, we see the same pattern: average cluster utilization hovers between 20-40%. This isn’t anecdotal; CNCF surveys consistently report that achieving high cluster utilization in Kubernetes is one of the hardest operational challenges teams face.

Think about what that means: If you’re spending $1M annually on cloud infrastructure, $600-800K of that is essentially waste. Your nodes are provisioned, your cores are allocated, but the actual workloads are using a fraction of what’s been reserved.

For GPU workloads, the problem becomes catastrophic. We’re seeing GPU utilization rates in the 10-25% range across customer clusters, despite GPUs costing 5-10x more per hour than standard compute. When a single H100 node costs $30K/month, that 15% utilization rate isn’t just inefficient, it’s business-breaking.

Why Kubernetes Makes Utilization So Hard

The root problem isn’t that Kubernetes can’t pack workloads efficiently. It’s that Kubernetes was built with assumptions that don’t match how real production workloads actually behave:

Static Resource Requests: Applications don’t use resources uniformly, but Kubernetes requires you to reserve peak capacity upfront. Your app might need 4GB during traffic spikes, but on average, 500MB Kubernetes reserves the 4GB permanently.
Bin Packing Without Intelligence: The scheduler places pods based on requests, not actual usage. It doesn’t know that your ML training job uses 100% GPU during training but 0% during checkpointing.
No Dynamic Rebalancing: Once a pod is scheduled, it’s stuck, even if the node becomes overcommitted or underutilized. The cluster can’t reorganize itself without manual intervention or pod evictions.
AI/ML Workload Characteristics: Distributed training jobs need all-or-nothing scheduling (gang scheduling), require specific GPU topologies, and have wildly variable resource patterns during different phases, none of which traditional Kubernetes handled well.

The Ecosystem Response: Filling the Gaps

The industry has been scrambling to solve this. We’re seeing a proliferation of solutions attempting to fill the utilization gap:

Vertical Pod Autoscalers (VPA): Tries to right-size resource requests, but has been notoriously unreliable in production (ask anyone who’s tried to run it at scale).
Karpenter and Cluster Autoscalers: Helps with node-level scaling but doesn’t address pod-level overprovisioning.
Custom Schedulers: Companies building their own schedulers (Volcano, Yunikorn) to handle batch workloads better.
FinOps Tools: Entire category of products (including what we’re building at Komodor) focused on cost visibility and optimization.

But here’s what’s significant about v1.35: Kubernetes itself is starting to provide native primitives to solve these problems.

What’s Actually New in v1.35 (And Why It Matters)

1. In-Place Pod Vertical Scaling (GA)

This is the big one. After years in alpha and beta, you can now modify CPU and memory requests/limits for running pods without restarting them.

Why this matters for utilization: You can now right-size workloads dynamically without downtime. That stateful database that you over-provisioned “just to be safe”? You can tune it while it’s running. Your ML inference service that needs different resources during business hours vs. overnight? Adjust it live.

The gap: Kubernetes gives you the primitive, but not the intelligence. You still need something (VPA, custom controllers, or platforms like ours) to decide when and how much to resize. The syscall exists; the control loop is your problem.

2. Gang Scheduling (Alpha)

Introduces Workload as a core type alongside PodGroups, enabling the scheduler to treat groups of pods atomically.

Why this matters for GPU utilization: Distributed AI training jobs fail catastrophically if workers start out of sync. Previously, you’d spin up 8 GPU pods for a training job, 7 would start immediately, the 8th would be pending, and you’d burn $200/hour on 7 idle GPUs waiting for capacity. Gang scheduling ensures all 8 start together, or none start at all.

The gap: This is still alpha. Multi-tenant GPU clusters with complex placement requirements (topology awareness, fabric-attached accelerators) need more than basic gang semantics; they need intelligent queuing and preemption policies.

3. Opportunistic Batching (Beta)

The scheduler can now reuse decisions across identical pods when processing large queues.

Why this matters: If you’re running inference services that spawn thousands of similar pods, scheduler overhead is becoming a bottleneck. This makes pod scheduling dramatically faster for homogeneous workloads.

The real impact: Faster scheduling means less time between “I need capacity” and “I have capacity”, which means less overprovisioning for burst scenarios.

4. Dynamic Resource Allocation (DRA) Maturity

DRA continues graduating features (binding conditions to Beta), making GPUs and specialized accelerators first-class schedulable resources.

Why this matters for GPU utilization: The scheduler can now understand GPU topology, fabric connectivity, and device-specific capabilities when placing workloads. No more manually debugging why your multi-GPU training job has 10x slower communication because pods landed on GPUs across different NUMA domains.

The Hard Truth: Primitives ≠ Solutions

Here’s what Kubernetes v1.35 gets right: it provides the kernel-level primitives for better utilization. In-place resize, gang scheduling, and structured device claims are the building blocks.

But Kubernetes is intentionally staying at the kernel layer. It’s giving you mmap and cgroups, not a database. The project won’t ship:

The intelligence to decide when to resize pods
The policies for preempting lower-priority workloads
The analytics to identify over-provisioned applications
The automation to continuously optimize cost vs. performance

That’s the user-space problem. And for most platform teams, building that user-space layer is not their core business.

What This Means for Platform Teams

If you’re running Kubernetes at scale, v1.35 is an important release, but it’s not magic:

Good news: The primitives are finally mature enough to build real utilization optimization on top of.

Reality check: You’ll need something, whether it’s building custom controllers, adopting ecosystem tools, or using platforms like Komodor, to actually use these primitives effectively.

The GPU economics: With GPU costs 5-10x higher than standard compute and utilization rates in the basement, the ROI on optimization tooling is obvious. If you’re spending $500K/year on GPUs at 20% utilization, even a 10-point improvement in utilization saves $125K annually.

Where Komodor Fits In

If Kubernetes v1.35 provides the “kernel” primitives, Komodor provides the operating system required to actually manage them. While the new in-place resize and gang scheduling features are powerful, they are passive tools; they do not know when to resize a pod or which job to prioritize.

Komodor bridges the gap between these raw primitives and business value through three key layers:

Context-Aware Automation: Komodor doesn’t just look at CPU metrics; it looks at application health and historical context. Instead of blindly resizing a pod because CPU usage dropped (which might be due to an upstream outage rather than low demand), Komodor correlates resource usage with availability signals. We utilize the new v1.35 In-Place Resize API to automatically trim waste from over-provisioned services only when it is safe to do so, ensuring cost savings never come at the expense of reliability.
GPU-Native Cost Optimization: For AI/ML platforms, Komodor integrates directly with the new Gang Scheduling and DRA interfaces. We provide a centralized dashboard that visualizes GPU fragmentation across your clusters. Komodor can identify “zombie” training jobs that are reserving GPUs but not utilizing them, and suggest (or automatically apply) policies to preempt them, freeing up capacity for high-priority queues.
The “Safety Net” for Autoscaling: Native autoscalers (VPA/HPA) are often “black boxes” that teams are afraid to enable in production. Komodor provides a “Shadow Mode” for these new v1.35 features. We show you exactly what would have happened if In-Place Resizing were enabled, how much money you would have saved, and if any OOM (Out of Memory) kills would have occurred, giving platform teams the confidence to turn these features on.

Technical Deep Dive: Key Features

In-Place Pod Resizing (GA)

Prior to v1.35, changing the resources field in a Pod spec was forbidden; it required destroying the Pod and creating a new one. This disrupted stateful connections and caused cold-start latency. With In-Place Pod Resizing hitting General Availability, the resources field in the Pod’s spec is now mutable.

How It Works

User/Controller Action: You patch the Pod spec with new CPU/Memory requests or limits.
Kubelet Calculation: The Kubelet on the node detects the change. It checks if the new request fits within the node’s available allocatable space.
Cgroup Adjustment: If the space is available, Kubelet updates the underlying cgroup limits for the container immediately.
Status Update: The Pod’s status.resize field is updated to reflect the progress (e.g., Proposed, InProgress, Infeasible).

The resizePolicy Field You can control how the resizing behaves using the resizePolicy list. For example, you might want to allow CPU to change without a restart, but require a restart for Memory changes (if your app crashes on memory shifts).

YAML Example: Defining a Resizable Pod

Komodor | Kubernetes v1.35: The Release That Tackles the Industry’s $100 Billion Waste Problem

Command Example: Triggering a Resize. To resize this running pod without killing it, you simply patch the manifest:

If the node has capacity, the CPU shares are updated instantly. If not, the resize is marked as Deferred until space frees up.

Gang Scheduling (Alpha)

Standard Kubernetes scheduling is pod-centric: it schedules one pod at a time. This is fatal for distributed training (e.g., PyTorch DistributedDataParallel), where all workers must be active simultaneously to establish a communication ring. If you need 4 GPUs and only 3 are available, standard Kubernetes will schedule 3 and leave them idle forever, waiting for the 4th (Deadlock).

Gang Scheduling introduces the concept of “All-or-Nothing.”

The PodGroup Concept In v1.35’s alpha implementation (often largely driven by the scheduling-plugins repo or the new Queueing initiatives), you define a group of pods that must be treated as a single atomic unit.

YAML Example: The PodGroup. First, you define the group requirements:

YAML Example: The Job. Then, you link your workload to this group via a label or annotation (depending on the specific scheduler plugin configuration used in v1.35):

The Scheduler’s Logic

Queueing: When the scheduler sees these pods, it holds them in the queue.
Accounting: It checks if the cluster has enough aggregate resources for minMember (4 GPUs).
Atomic Dispatch:
- Scenario A: Only 3 GPUs are free. The scheduler keeps all pods in Pending. Zero resources are wasted.
- Scenario B: 4 GPUs are free. The scheduler binds all 4 pods to nodes simultaneously.

Breaking Changes You Need to Know About

cgroup v1 removal: This isn’t a deprecation; it’s removal. If you’re running CentOS 7 nodes or other older distros, you cannot upgrade to v1.35 without migrating to cgroup v2 first.
containerd 1.x EOL: v1.35 is the last release supporting containerd 1.x. Upgrade to 2.0+ before your next Kubernetes upgrade.
IPVS mode deprecation: Formally deprecated. Start planning migration to nftables mode.

Other Notable Improvements

Storage version migration (Beta): Native support for storage version migration, eliminating manual kubectl gymnastics.
Constrained impersonation (Alpha): Security improvement preventing privilege escalation through impersonation.
Node declared features (Alpha): Helps with version skew in mixed-version clusters.

Getting Excited About What v1.35 Means for Production Teams

Kubernetes v1.35 represents maturity rather than revolution. The Gang Scheduling and Opportunistic Batching features signal that Kubernetes is serious about being the de facto platform for AI/ML workloads, not just stateless microservices. The graduation of In-Place Vertical Scaling to GA shows the project’s commitment to supporting stateful, long-running workloads with less disruption.

For platform engineering teams, this release reduces the gap between “it works in a demo” and “it works in production at scale.” The focus on scheduler intelligence, performance optimization, and security hardening reflects the real-world challenges of operating Kubernetes clusters that support hundreds of teams and thousands of workloads.

For teams running AI/ML workloads or operating large-scale clusters, it’s always a good practice to prioritize testing these features in staging environments to start taking advantage of these features.

What utilization rates are you seeing in your clusters? How are you tackling the cost problem? Let us know in the comments or reach out, we’d love to hear what’s working (and what’s not) in your environment.

About Komodor

Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit Komodor website, join the Komodor Kommunity, and follow us on LinkedIn and X.

To request a demo, visit the Contact Sales page.

Media Contact:
Marc Gendron
Marc Gendron PR for Komodor
[email protected]
617-877-7480

Latest Blogs

AI SRE in Practice: Enabling Non-Experts to Troubleshoot Kubernetes

Part 8 of our AI SRE in Practice Series. This scenario walks through how AI-augmented troubleshooting enables engineers without Kubernetes expertise to diagnose and resolve complex issues, using a real example from a team onboarding non-experts to platform operations.

When AI Writes the Code, Who Pays the Cloud Bill?

We recently wrote about how AI-generated code is overwhelming SRE teams with production complexity they can't manage. Turns out that's only half the problem. The other half shows up on the cloud bill.

Komodor Triples Revenue as AI-Driven Site Reliability Engineering (SRE) Reshapes Cloud-Native Operations

Company doubled its share of Fortune 500 customers with surging demand for AI-powered reliability and cost control.