KubeCon Atlanta 2025 & the AI-Native Shift

KubeCon + CloudNativeCon North America 2025 in Atlanta marked a definitive moment for cloud-native infrastructure. Over four days, celebrating the 10th anniversary of both CNCF and Kubernetes, more than 9,000 attendees witnessed the ecosystem’s evolution from container orchestration to AI-native operations. The conference delivered a clear message – AI workloads are no longer experimental. The focus has shifted to standardization, production-ready tooling, and building platforms that can orchestrate intelligent, autonomous systems at scale.

Three major themes dominated the conversation. 

  • First, the launch of the Certified Kubernetes AI Conformance Program established baseline requirements for running AI workloads, with Dynamic Resource Allocation (DRA) becoming table stakes. 
  • Second, platform engineering matured from buzzword to standardized discipline, with tools like Crossplane, Backstage, and Argo CD forming the backbone of self-service Internal Developer Platforms. 
  • Third, the retirement of Ingress NGINX and the rise of the Gateway API signaled a forced evolution in how teams manage traffic routing for modern workloads.

If you missed the event or couldn’t attend every session, here are the talks that captured some interesting (IMO) technical shifts happening in the Kubernetes ecosystem.

Top 10 Must-Watch Talks on AI and Platform Engineering

1. A Journey To Zero-Downtime Upgrades With Keycloak

  • Speakers: Martin Bartoš & Ryan Emerson (IBM)
  • Summary: The speakers detail the critical challenge of upgrading Keycloak—an identity management service where downtime disrupts business workflows—without service interruption. They introduce a “rolling update” strategy that replaces instances one at a time while handling complex issues like Infinispan clustering compatibility and database schema changes. The team presents a solution using a compatibility metadata provider and a new CLI command (update compatibility) integrated into a Kubernetes operator to automatically reject unsafe upgrades.
  • Why Watch: This session is essential for DevOps engineers and Keycloak administrators who need a technical blueprint for achieving high availability during stateful application upgrades, specifically solving the “split brain” and version compatibility issues inherent in clustered identity services.

2. AI Models Are Huge, but Your GPUs Aren’t: Mastering Multi-Node Distributed Inference

  • Speakers: Ernest Wong & Joshin Shan
  • Summary: This talk addresses the difficulty of serving massive LLMs (100B+ parameters) that physically cannot fit on a single GPU by utilizing distributed inference strategies. The speakers explain essential techniques like Tensor Parallelism (splitting model layers within a node) and Pipeline Parallelism (splitting layers across nodes) to manage memory footprints. Crucially, they advocate for “Prefill-Decode Disaggregation,” a technique that separates compute-heavy initial processing from memory-bound token generation to prevent interference and reduce latency.
  • Why Watch: You should watch this to understand the architectural patterns required to self-host massive AI models cost-effectively on Kubernetes, particularly if you need to optimize for “time to first token” and overall throughput.

3. Beyond ChatOps: Agentic AI in Kubernetes

  • Speakers: Panel (Pavneet Ahluwalia, Idit Levine, Arik Alon, Valeria Ortiz)
  • Summary: This panel discussion explores the evolution from passive chatbots to proactive AI agents (like Holmes GPT) that can autonomously troubleshoot, detect issues, and perform Root Cause Analysis (RCA) within Kubernetes clusters. The speakers debate the significant challenges of this approach, including the risks of non-deterministic model behavior, the difficulty of managing security permissions (RBAC) for agents, and the problem of AI hallucinations lacking broader business context.
  • Why Watch: It is worth watching to get a realistic view of the current state of “Agentic AI” in Ops, offering valuable insights into the necessary guardrails—such as sandboxing and human-in-the-loop workflows—required before trusting AI to actively manage production infrastructure.

4. Economics of Platforms: Building Marketplaces Beyond Golden Paths

  • Speaker: Atulpriya Sharma
  • Summary: The speaker proposes shifting platform engineering from a centralized “Golden Path” model—where a small team builds everything—to an internal “Marketplace” model inspired by Yelp, where various teams contribute capabilities. In this model, the platform team acts as a curator that enforces quality through automated checks and governance, allowing specialized teams (e.g., security or ML) to publish their own validated templates for others to consume.
  • Why Watch: This session is valuable for platform leaders struggling with scaling issues; it offers a sustainable organizational strategy to unclog the bottleneck of platform delivery by incentivizing “inner sourcing” and distributing expertise across the company.

5. GitOps Without Variables

  • Speakers: Brian Grant & Alexis Richardson
  • Summary: Brian Grant (original architect of Kubernetes) and Alexis Richardson argue that the complexity of templating tools like Helm and Kustomize leads to configuration sprawl and “blast radius” risks. They propose a paradigm shift to “fully rendered,” explicit configuration (Wet configuration) stored as data in a database rather than code, allowing for safer, atomic changes across thousands of targets using programmatic functions.
  • Why Watch: This is a must-watch for those experiencing “YAML fatigue” or operational incidents caused by complex variable overrides; it presents a radical alternative (“ConfigHub”) that simplifies debugging and auditing by making the desired state explicit and queryable.

6. GitOps and the Manifest Dilemma: Helm, Kustomize, Crossplane, Kro, and Beyond

  • Speaker: Dag Bjerre Andersen
  • Summary: The speaker categorizes the chaotic landscape of Kubernetes templating tools into “out-of-cluster rendering” (CLI tools like Helm/Kustomize) and “in-cluster rendering” (Operators like Crossplane/Kro). He highlights the operational trade-offs, noting that while in-cluster tools offer dynamic capabilities, they often break the GitOps feedback loop by masking health checks and complicating local validation.
  • Why Watch: This talk is essential for architects choosing a templating stack, as it provides a clear framework for evaluating tools based on their “day two” operational reality—specifically how easily developers can validate changes locally and how reliably ArgoCD or Flux can detect deployment failures.

7. In-Place Pod Resize in Kubernetes: Dynamic Resource Management Without Restarting

  • Speakers: Mofi Rahman & Tim Allclair
  • Summary: The speakers introduce “In-Place Pod Resize,” a Kubernetes feature that allows vertical scaling of CPU and memory without restarting the pod, a major improvement over the standard Vertical Pod Autoscaler (VPA). They explain the technical flow—from the scheduler updating the pod spec to the container runtime actuating the change—and discuss current limitations, such as the inability to change Quality of Service (QoS) classes and how specific runtimes like Java might still require restarts to recognize memory changes.
  • Why Watch: This is worth watching for anyone running stateful workloads, game servers, or expensive “start-up boost” applications where restarting containers to adjust resources is disruptive or cost-prohibitive.

8. Keynote: Supply Chain Reaction: A Cautionary Tale in K8s Security

  • Speakers: S. Potter & A.G. Veytia
  • Summary: Through a comedic skit involving a compromised compiler image, this keynote demonstrates how a supply chain attack works and how to prevent it using the “Avengers” of CNCF security tools: SLSA, Sigstore, and Kyverno. The presenters walk through securing a pipeline by adding attestations, signing images, and enforcing policies that block untrusted builds, ultimately pitching the “Open Source Project Security Baseline” as a map for incremental security improvements.
  • Why Watch: It is a highly accessible entry point for understanding complex security concepts, effectively showing how various open-source projects integrate to create a hardened, verifiable software supply chain.

9. Ulysses’ Odyssey Through Platform Engineering

  • Speaker: William Rizzo
  • Summary: The speaker uses metaphors from Homer’s Odyssey to illustrate the emotional and cultural hurdles of building internal developer platforms, equating “Lotus Eaters” to distraction and “Cyclops” to integration complexity. He argues against blindly following trends (“Sirens”) and emphasizes that platform success depends on measurable business goals (like MTTR) and collaboration with users, rather than just technical implementation.
  • Why Watch: This session is worth watching for a non-technical, philosophical perspective that focuses on the human and strategic pitfalls of platform engineering, reminding teams to avoid “hubris” and focus on the journey of user adoption.

10. Your Kubernetes Playbook at Your Fingertips: Advanced Troubleshooting with AI

  • Speakers: David vonThenen & Yash Sharma
  • Summary: The speakers demonstrate how to integrate organizational runbooks (specifically an NSA hardening guide) into an AI troubleshooting workflow using k8sgpt and RAG (Retrieval-Augmented Generation). They show a live demo using the Model Context Protocol (MCP) to connect an LLM with Kubernetes tools, allowing the agent to analyze a cluster against specific, private compliance documents rather than just generic knowledge.
  • Why Watch: This is valuable for teams wanting to move beyond generic “ChatGPT for Ops” to building tailored AI agents that respect strict internal security policies and specific operational playbooks.

The Bigger Picture

These sessions represent more than individual technical topics. Together, they illustrate how Kubernetes is evolving from a container orchestration platform into the operating system for AI-native infrastructure.

The Certified Kubernetes AI Conformance Program and stable DRA provide the foundation for predictable accelerator scheduling. The emergence of agent-specific controllers shows the platform adapting to manage autonomous, reasoning systems. Crossplane and platform engineering patterns deliver the self-service experience developers need without sacrificing the control platform teams require. The Gateway API evolution and policy-as-code extensions demonstrate the ecosystem maturing to handle the unique requirements of AI workloads.

For platform engineering teams, the message is clear – the tools and standards have arrived. The Gateway API provides modern networking primitives, DRA handles complex hardware scheduling, Crossplane enables governed self-service, and policy-as-code extends to model governance. The foundation is set for the next phase, operating intelligent systems at a massive scale.

Despite some vendor fatigue and logistical issues, including a memorable WiFi outage, the community demonstrated strong alignment around practical, value-driven solutions over hype. The focus on standardization, conformance programs, and proven patterns suggests the ecosystem is maturing past the experimentation phase into production-grade infrastructure for AI-native organizations.