Home
Komodor Blog
Welcome to the Next Frontier: AI on Kubernetes

Welcome to the Next Frontier: AI on Kubernetes

Ilan Adler

5 min read November 28th, 2025

Last week’s KubeCon Atlanta made one thing abundantly clear, Kubernetes is quickly becoming the de facto platform for AI workloads – with the event lineup chock full of talks, workshops, and even co-located events dedicated to AI, machine learning and running data on Kubernetes natively – with approximately 50 (!) sessions in total focused on AI, ML, LLM, and GenAI topics..

What was until now mostly PoCs and aspirational is now truly delivering in production. Running data-intensive AI/ML workloads, particularly Large Language Model inference, on Kubernetes is reshaping how infrastructure teams operate and what they’re expected to deliver, and the tooling is finally catching up to support these evolving needs.

This signals a fundamental shift in what Kubernetes means for organizations. Data on Kubernetes is becoming the operational default in a world where AI applications are expected to grow exponentially. The reasons are compelling, but not surprising – the scalability, flexibility, resilience, openness, and cost efficiencies that we’ve learned to love for running our applications and systems have become equally attractive for data-intensive workloads.

LLMs, GPUs, and Scale

The sheer size and complexity of modern AI models, some with over 600 billion parameters, coupled with the critical need for accelerators like GPUs and TPUs, ensures that cloud-native infrastructure for AI will only grow larger and more complex.

Deploying these models for inference presents significant infrastructure challenges. Platform teams must grapple with high latency, spiraling costs, GPU scarcity, and the necessity of highly efficient multi-node, multi-accelerator architectures. The GA release of Dynamic Resource Allocation (DRA) represents a major step forward, becoming the key enabler for advanced GPU and accelerator management.

The AI/ML workflow is also pushing to the limits what the platforms themselves must support. Organizations now need to manage the entire specialized ML lifecycle – from pipeline orchestration and distributed training to fine-tuning and high-performance serving, often across diverse GPU fleets. Projects like llm-d are emerging as Kubernetes-native distributed-inference stacks built around tools like vLLM, with KV Cache offloading and routing delivering up to 5x throughput improvements.

The Data Layer Evolution

Beyond stateless serving, Kubernetes is increasingly the home for stateful data applications. Projects like LanceDB are unifying vector search and SQL analytics on Kubernetes for hyper-scalable AI data lakes. The SIG Storage team continues to enhance support for file, block, and object storage to meet these demanding workloads.

This convergence of compute and data on Kubernetes creates both opportunity and complexity. Organizations gain the ability to run their entire AI stack on a unified platform, but they also inherit the operational burden of managing stateful workloads alongside ephemeral inference services.

A New Challenge for Platform Teams

Making Kubernetes self-serve for application developers is already a challenge that most platform teams haven’t fully solved. Now add a new user persona to the mix – the data scientist or ML engineer who needs self-service capabilities without being burdened by Kubernetes complexity. These users need to deploy models, run experiments, and iterate quickly. They don’t want to learn kubectl, and they shouldn’t have to.

This expansion of the Kubernetes user base beyond traditional application developers compounds an already significant operational challenge. Data engineers and data scientists will inevitably encounter issues they can’t resolve themselves, leading to more escalations for SRE teams. Without more efficient troubleshooting capabilities, platform teams risk becoming bottlenecks to AI initiatives, on top of the bottlenecks they’re already managing for their existing users.

The goal remains the same – maximize developer productivity while maintaining operational health, security, and cost control. But the stakes are now higher. Platform abstractions must now be built to welcome newcomers and empower power users simultaneously, across an even wider spectrum of technical backgrounds and use cases.

The Self-Service Imperative

Many KubeCon sessions focused on what might be called the “Not-Code Challenge”, how to make platforms accessible to users who aren’t infrastructure experts and don’t want to become ones. The goal is turning platform expertise into tradable assets: instead of every team needing to understand the underlying complexity, they consume capabilities through curated, self-service experiences.

The vision emerging from these sessions is evolving platforms from centralized services into Internal Developer Marketplaces. Users browse and deploy infrastructure capabilities the way they’d install an app – without needing to understand what’s happening underneath. For AI workloads specifically, this means providing abstractions for model deployment, GPU allocation, and inference scaling that don’t require deep Kubernetes knowledge.

The teams that succeed will be those that can democratize access to infrastructure while maintaining the guardrails necessary for production operations.

But self-service at scale creates its own unique set of problems, because when more users deploy more workloads, this means more things will break in new ways. If you’re handing data scientists the keys to GPU clusters without also rethinking how you handle incidents, you’re just accelerating toward a support bottleneck.

Fighting Fires with Intelligence & Automated Remediation

This also holds true, because traditional troubleshooting methods are still too slow for dynamic AI environments. When an inference service degrades or a training job fails, the cost in both time and compute can be substantial. The solution lies in building intelligent systems that automate operations and reduce the need for human intervention.

This is where AI-powered operations entered the KubeCon conversation in full force. Model Context Protocol (MCP) and Agentic AI emerged as key enabling technologies, allowing intelligent agents to automate diagnosis and remediation workflows. Rather than waiting for an SRE to investigate an alert, work through runbooks, and escalate as needed, AI SRE agents can diagnose issues in real-time and either resolve them automatically or present operators with validated remediation options.

The emphasis on “validated” matters. These aren’t black-box systems making arbitrary changes – they’re implementing guardrail workflows that ensure remediation actions are secure and auditable. The goal is reducing Mean Time to Resolve (MTTR) for critical cluster issues while maintaining the operational controls that production environments require. Salesforce showcased an AIOps system built on these principles that now manages over 1,000 Kubernetes clusters.

This shift from reactive to proactive operations is essential as AI workloads scale. More users, more complex workloads, and higher costs per minute of downtime demand operational approaches that can match the speed and scale of the systems they support.

What This Means for Platform Teams

KubeCon 2025 confirmed that AI on Kubernetes isn’t a future state – it’s the current reality. Platform, DevOps, and SRE teams are being asked to support workloads that are larger, more expensive, and used by personas who don’t speak Kubernetes.

Success in this environment requires rethinking how platforms are built and operated. Self-service abstractions must be robust enough to empower all the new personas that will need to consume them, while maintaining operational guardrails. Troubleshooting must become intelligent and automated to handle the volume and velocity of issues that AI workloads generate. And cost management must be baked into every layer, because GPU minutes are expensive.

The organizations that get this right will be those that treat their platform as a product, one that needs to serve an expanding user base with diverse needs & curate the experience without sacrificing safety. The infrastructure challenges of AI are significant, but so is the opportunity to build platforms that truly accelerate what organizations can accomplish.

About Komodor

Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit Komodor website, join the Komodor Kommunity, and follow us on LinkedIn and X.

To request a demo, visit the Contact Sales page.

Media Contact:
Marc Gendron
Marc Gendron PR for Komodor
[email protected]
617-877-7480

Latest Blogs

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

Welcome to the Next Frontier: AI on Kubernetes

LLMs, GPUs, and Scale

The Data Layer Evolution

A New Challenge for Platform Teams

The Self-Service Imperative

Fighting Fires with Intelligence & Automated Remediation

What This Means for Platform Teams

About Komodor

Latest Blogs

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

Komodor Unveils Proactive Optimization to Unlock Stranded Cluster Capacity

Solved: fatal: Not a git repository (or any of the parent directories): .git

Welcome to the Next Frontier: AI on Kubernetes

LLMs, GPUs, and Scale

The Data Layer Evolution

A New Challenge for Platform Teams

The Self-Service Imperative

Fighting Fires with Intelligence & Automated Remediation

What This Means for Platform Teams

About Komodor

Latest Blogs

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

Komodor Unveils Proactive Optimization to Unlock Stranded Cluster Capacity

Solved: fatal: Not a git repository (or any of the parent directories): .git

Get started with Komodor

Get started with Komodor

AI SRE Summit 2026

You're In!