Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Here’s what they’re saying about Komodor in the news.
Kubernetes v1.35 dropped a couple of weeks ago, and while the headlines focus on gang scheduling and in-place resizing going GA, there’s a bigger story here that every platform team needs to understand: Kubernetes is finally acknowledging that cluster utilization is fundamentally broken.
At Komodor, we work with hundreds of organizations running Kubernetes at scale. Across every customer, every vertical, every cloud provider, we see the same pattern: average cluster utilization hovers between 20-40%. This isn’t anecdotal; CNCF surveys consistently report that achieving high cluster utilization in Kubernetes is one of the hardest operational challenges teams face.
Think about what that means: If you’re spending $1M annually on cloud infrastructure, $600-800K of that is essentially waste. Your nodes are provisioned, your cores are allocated, but the actual workloads are using a fraction of what’s been reserved.
For GPU workloads, the problem becomes catastrophic. We’re seeing GPU utilization rates in the 10-25% range across customer clusters, despite GPUs costing 5-10x more per hour than standard compute. When a single H100 node costs $30K/month, that 15% utilization rate isn’t just inefficient, it’s business-breaking.
The root problem isn’t that Kubernetes can’t pack workloads efficiently. It’s that Kubernetes was built with assumptions that don’t match how real production workloads actually behave:
The industry has been scrambling to solve this. We’re seeing a proliferation of solutions attempting to fill the utilization gap:
But here’s what’s significant about v1.35: Kubernetes itself is starting to provide native primitives to solve these problems.
This is the big one. After years in alpha and beta, you can now modify CPU and memory requests/limits for running pods without restarting them.
Why this matters for utilization: You can now right-size workloads dynamically without downtime. That stateful database that you over-provisioned “just to be safe”? You can tune it while it’s running. Your ML inference service that needs different resources during business hours vs. overnight? Adjust it live.
The gap: Kubernetes gives you the primitive, but not the intelligence. You still need something (VPA, custom controllers, or platforms like ours) to decide when and how much to resize. The syscall exists; the control loop is your problem.
Introduces Workload as a core type alongside PodGroups, enabling the scheduler to treat groups of pods atomically.
Why this matters for GPU utilization: Distributed AI training jobs fail catastrophically if workers start out of sync. Previously, you’d spin up 8 GPU pods for a training job, 7 would start immediately, the 8th would be pending, and you’d burn $200/hour on 7 idle GPUs waiting for capacity. Gang scheduling ensures all 8 start together, or none start at all.
The gap: This is still alpha. Multi-tenant GPU clusters with complex placement requirements (topology awareness, fabric-attached accelerators) need more than basic gang semantics; they need intelligent queuing and preemption policies.
The scheduler can now reuse decisions across identical pods when processing large queues.
Why this matters: If you’re running inference services that spawn thousands of similar pods, scheduler overhead is becoming a bottleneck. This makes pod scheduling dramatically faster for homogeneous workloads.
The real impact: Faster scheduling means less time between “I need capacity” and “I have capacity”, which means less overprovisioning for burst scenarios.
DRA continues graduating features (binding conditions to Beta), making GPUs and specialized accelerators first-class schedulable resources.
Why this matters for GPU utilization: The scheduler can now understand GPU topology, fabric connectivity, and device-specific capabilities when placing workloads. No more manually debugging why your multi-GPU training job has 10x slower communication because pods landed on GPUs across different NUMA domains.
Here’s what Kubernetes v1.35 gets right: it provides the kernel-level primitives for better utilization. In-place resize, gang scheduling, and structured device claims are the building blocks.
But Kubernetes is intentionally staying at the kernel layer. It’s giving you mmap and cgroups, not a database. The project won’t ship:
That’s the user-space problem. And for most platform teams, building that user-space layer is not their core business.
If you’re running Kubernetes at scale, v1.35 is an important release, but it’s not magic:
Good news: The primitives are finally mature enough to build real utilization optimization on top of.
Reality check: You’ll need something, whether it’s building custom controllers, adopting ecosystem tools, or using platforms like Komodor, to actually use these primitives effectively.
The GPU economics: With GPU costs 5-10x higher than standard compute and utilization rates in the basement, the ROI on optimization tooling is obvious. If you’re spending $500K/year on GPUs at 20% utilization, even a 10-point improvement in utilization saves $125K annually.
If Kubernetes v1.35 provides the “kernel” primitives, Komodor provides the operating system required to actually manage them. While the new in-place resize and gang scheduling features are powerful, they are passive tools; they do not know when to resize a pod or which job to prioritize.
Komodor bridges the gap between these raw primitives and business value through three key layers:
Prior to v1.35, changing the resources field in a Pod spec was forbidden; it required destroying the Pod and creating a new one. This disrupted stateful connections and caused cold-start latency. With In-Place Pod Resizing hitting General Availability, the resources field in the Pod’s spec is now mutable.
How It Works
The resizePolicy Field You can control how the resizing behaves using the resizePolicy list. For example, you might want to allow CPU to change without a restart, but require a restart for Memory changes (if your app crashes on memory shifts).
YAML Example: Defining a Resizable Pod
Command Example: Triggering a Resize. To resize this running pod without killing it, you simply patch the manifest:
If the node has capacity, the CPU shares are updated instantly. If not, the resize is marked as Deferred until space frees up.
Standard Kubernetes scheduling is pod-centric: it schedules one pod at a time. This is fatal for distributed training (e.g., PyTorch DistributedDataParallel), where all workers must be active simultaneously to establish a communication ring. If you need 4 GPUs and only 3 are available, standard Kubernetes will schedule 3 and leave them idle forever, waiting for the 4th (Deadlock).
Gang Scheduling introduces the concept of “All-or-Nothing.”
The PodGroup Concept In v1.35’s alpha implementation (often largely driven by the scheduling-plugins repo or the new Queueing initiatives), you define a group of pods that must be treated as a single atomic unit.
YAML Example: The PodGroup. First, you define the group requirements:
YAML Example: The Job. Then, you link your workload to this group via a label or annotation (depending on the specific scheduler plugin configuration used in v1.35):
The Scheduler’s Logic
Kubernetes v1.35 represents maturity rather than revolution. The Gang Scheduling and Opportunistic Batching features signal that Kubernetes is serious about being the de facto platform for AI/ML workloads, not just stateless microservices. The graduation of In-Place Vertical Scaling to GA shows the project’s commitment to supporting stateful, long-running workloads with less disruption.
For platform engineering teams, this release reduces the gap between “it works in a demo” and “it works in production at scale.” The focus on scheduler intelligence, performance optimization, and security hardening reflects the real-world challenges of operating Kubernetes clusters that support hundreds of teams and thousands of workloads.
For teams running AI/ML workloads or operating large-scale clusters, it’s always a good practice to prioritize testing these features in staging environments to start taking advantage of these features.
What utilization rates are you seeing in your clusters? How are you tackling the cost problem? Let us know in the comments or reach out, we’d love to hear what’s working (and what’s not) in your environment.
Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit Komodor website, join the Komodor Kommunity, and follow us on LinkedIn and X.
To request a demo, visit the Contact Sales page.
Media Contact:Marc GendronMarc Gendron PR for Komodor[email protected]617-877-7480
Share:
Gain instant visibility into your clusters and resolve issues faster.