Komodor Blog

Troubleshooting articles
Page 1
Welcome to Komodor's blog, your go-to resource for insights on all things Kubernetes. Stay tuned for expert advice, in-depth tutorials, and the latest industry trends to help you throughout your K8s journey.

When AI Writes the Code, Who Keeps Production Running?

6 min read

The acceleration of AI-assisted development has created an asymmetric problem. Developers got their force multiplier. SREs are still using the same playbook they had five years ago, except now they're responsible for exponentially more code, written by tools that prioritize speed over operational clarity.

AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

6 min read

Part 7 of our AI SRE in Practice Series. This scenario walks through how AI-augmented knowledge transfer changes the onboarding experience, using a real example from a containers team implementing changes to HiveMQ infrastructure.

AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

6 min read

Part 6 of our AI SRE in Practice Series. In this scenario we walk through an AWS CNI IP exhaustion incident where 15 services experienced outages before platform teams identified the root cause.

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

6 min read

Part 5 of our AI SRE in Practice Series. This scenario walks through a policy enforcement incident where a seemingly minor configuration change caused widespread pod failures that required deep investigation across the cluster to understand the scope and root cause.

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

6 min read

When a new, competing open-source Kubernetes troubleshooting agent was launched, we thought it would be a good idea to put both tools through identical real-world failure scenarios our customers typically encounter. The objective was to benchmark Klaudia Agentic AI and the open-source AI agent, and compare their performance across common Kubernetes failure scenarios.

klaudia-blueprints-knowledge-base

AI SRE in Practice: Resolving Node Termination Events at Scale

6 min read

Part 4 of our AI SRE in Practice Series. In this part we examine what happens when a node terminates unexpectedly, and dealing with the harder question of why it happened and how to prevent it from happening in the future.

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

5 min read

Part 3 of our AI SRE in Practice Series. In this part we cover how an AI SRE helps diagnose configuration drift in deployment failures.

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

4 min read

Part 2 of the AI SRE in Practice Series. In this post we discuss: Resolving GPU Hardware Failures in Seconds

When is it ok or not ok to trust AI SRE with your production reliability?

3 min read

This series demonstrates what AI SRE trained on real workloads actually looks like in practice. We're going to walk through real troubleshooting scenarios that our customers encounter daily, showing the before and after of AI-powered investigations.