Kubernetes Archives

Rightsizing & Handling Resource Allocation in Kubernetes

8 min read

Pods crashing? Resources wasted? Master resource allocation in Kubernetes with proven rightsizing strategies that work in production.

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

4 min read

Part 2 of the AI SRE in Practice Series. In this post we discuss: Resolving GPU Hardware Failures in Seconds

When is it ok or not ok to trust AI SRE with your production reliability?

3 min read

This series demonstrates what AI SRE trained on real workloads actually looks like in practice. We're going to walk through real troubleshooting scenarios that our customers encounter daily, showing the before and after of AI-powered investigations.

From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

4 min read

This series demonstrates what AI SRE trained on real workloads actually looks like in practice. We're going to walk through real troubleshooting scenarios that our customers encounter daily, showing the before and after of AI-powered investigations.

7 Kubernetes Predictions for 2026 – AI Will Push SRE to its Limit

3 min read

SRE teams are about to feel even more pressure. GPU-heavy computing is breaking the assumptions today's clusters were built on, while enterprises are beginning to trust autonomous operations and cost pressure is pushing consolidation across the cloud-infrastructure stack. Based on these forces, here are my 2026 Kubernetes predictions as well as some best practice recommendations to help platform teams prepare for what reliable operations will mean next year.

Kubernetes v1.35: The Release That Tackles the Industry’s $100 Billion Waste Problem

8 min read

There's a bigger story here that every platform team needs to understand: K8s is finally acknowledging that cluster utilization is fundamentally broken.

The War Room of AI Agents: Why the Future of AI SRE is Multi-Agent Orchestration

7 min read

The teams that learn to build and coordinate AI agent capabilities alongside human expertise will be the ones that thrive in the increasingly complex world of Cloud-Native infrastructure and recover faster when AI-driven incidents become more common.

Welcome to the Next Frontier: AI on Kubernetes

5 min read

KubeCon 2025 confirms AI on Kubernetes is a production reality. This post explores the platform challenges, from managing large LLMs and GPU resources to empowering new personas like data scientists, and the shift toward self-service and intelligent, automated operations.

Autonomous Self-Healing Capabilities for Cloud-Native Infrastructure and Operations

6 min read

With autonomous self-healing and continuous optimization, we're flipping the script on the traditional management model. Organizations can move from firefighting to proactive resilience. The traditional reactive model can't scale with the complexity and pace of modern cloud-native infrastructure. Teams that adopt autonomous operations gain compounding advantages: more time for innovation, lower operational costs, better reliability, and SRE teams focused on building the future instead of firefighting the present.

Komodor Blog

Filtered by Category