Home
Komodor Blog
7 Kubernetes Predictions for 2026 – AI Will Push SRE to its Limit

7 Kubernetes Predictions for 2026 – AI Will Push SRE to its Limit

Itiel Shwartz, CTO & co-founder

3 min read December 29th, 2025

Note: This is a reprint of an article published on VMBLOG

As AI workloads shift from training to massive-scale inference, SRE teams are about to feel even more pressure. GPU-heavy computing is breaking the assumptions today’s clusters were built on, while enterprises are beginning to trust autonomous operations and cost pressure is pushing consolidation across the cloud-infrastructure stack. Based on these forces, here are my 2026 Kubernetes predictions as well as some best practice recommendations to help platform teams prepare for what reliable operations will mean next year.

As AI/ML use continues to increase more workloads will move from training to inference. Even the new GKE experiments are showing signs of this, as the huge number of nodes that they scale up with contain a significant amount of inference workloads.

AI SRE will make a significant adoption impact. As more organizations deploy cloud native infrastructure, and GenAI cutting time to market for their competitors, platform teams will understand that to continue to innovate and lead, they need to scale up their SRE teams. With Kubernetes experts at a premium, AI SRE will prove to be the missing ingredient that allows them to adapt.

Cloud operations will start to move towards autonomy. As more and more AI powered tooling is adopted, and users trust it more, we will see a movement among traditionally conservative enterprises towards allowing some operations to be autonomously managed by AI.

Cloud-native job queueing systems, like Kueue will see a major uptick in adoption, as the race for deploying HPC, AI/ML, and even quantum applications heats up. Since previous queue systems are not built for this scale, new tooling will quickly be implemented across the industry.

With applications and workloads relying on more compute than ever before, Kubernetes scheduling will require a makeover. The current pod-centric approach will not be able to handle this increased scale, so a more workload specific approach for the scheduler will be required. The community is actively working on this through KEP-4671: Gang Scheduling, which will be managed natively in K8s.

GPU overprovisioning will become a more pressing problem. As the macro economic climate continues to push towards greater efficiency, organizations will have to find ways to optimize their GPU monitoring and usage.

FinOps tools will start to consolidate with other products in the cloud infrastructure stack. Similar to what is happening in cloud security, products will consolidate different capabilities, including observability, insights, tracing, cost optimization and troubleshooting, into a single platform. This will remove cognitive load from teams struggling to keep up with too many dashboards and products.

These trends point to a 2026 where Kubernetes complexity, AI-driven operations, and compute-heavy workloads reshape what “good” SRE looks like. To stay ahead of the curve, platform teams should consider the following steps:

Prepare your clusters for AI-driven autonomy
Standardize telemetry, event schemas, and operational APIs so AI SRE agents can reliably diagnose and execute actions. Wrap all automated operations in policy-as-code, dry-run workflows, and auditability to ensure safe incremental automation.
Modernize scheduling for GPU- and HPC-heavy workloads
Begin testing Gang Scheduling and Kueue-like job orchestration. Update autoscaling, quotas, and node pools to support workload-level guarantees rather than pod-level heuristics, this will be important as inference and HPC workloads dominate compute demand.
Treat GPU efficiency and capacity as SLOs
Instrument GPU usage, enforce right-sizing at admission, and integrate GPU saturation, fragmentation, and queue depth into autoscaling signals. Optimizing GPU utilization must become a core reliability responsibility, not just a cost exercise.

About Komodor

Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit https://komodor.com/, join the Komodor Kommunity, and follow us on LinkedIn and X.

To request a demo, visit https://komodor.com/contact-sales/.

Media Contact:
Marc Gendron
Marc Gendron PR for Komodor
[email protected]
617-877-7480

Latest Blogs

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

The AI-SRE agent is tasked not with maximizing uptime but with automating the balance between the risk of unavailability and the goals of rapid innovation and efficient service operations, optimizing the users’ overall happiness.

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

This article explores the technical realities of building Klaudia, an agentic AI solution for Cloud-Native infrastructure.

Komodor Named a Representative Vendor in the 2026 Gartner® Market Guide for AI Site Reliability Engineering Tooling

Komodor Named a Representative Vendor in the 2026 Gartner® Market Guide for AI Site Reliability Engineering Tooling Komodor's AI SRE platform helps organizations maximize uptime, reduce cloud costs, and simplify operations across complex, cloud-native environments