For years, AI in operations was plagued by noise—overwhelming alerts, false positives, and a lack of actionable insights. The tools available promised much, but often delivered little, leading to a loss of trust. However, with the groundbreaking work by platforms like OpenAI and the emergence of trustworthy AI tools like Copilot, the potential of AI in operations has never been nearer and clearer. AIOps doesn’t just sound plausible—it could actually deliver meaningful value, helping us solve problems faster and more effectively. We’ve written about this extensively historically, but troubleshooting in complex cloud native systems assumes, (as noted in this InfoQ article on what cloud native systems require beyond observability): You have the required permissions to all of the relevant systems You understand the entire stack and all of the technologies within all these systems You have the experience required to understand the issue sufficiently to solve it AIOps has been what many see as the promise to help make this happen ‘automagically’. At Komdor, like every technology-driven company today, this new wave of AI piqued our interest. We envisioned a tool that could transform AIOps from its current track record to a better and much more useful tool that not only detects issues in K8s environments but also provides precise, actionable steps to resolve them. Today, I can confidently state that we have managed to deliver on the promise of K8s AIOps for Kubernetes. Introducing KlaudiaAI We’re excited to announce today our latest product, KlaudiaAI, designed from the ground up to tackle the unique challenges of Kubernetes operations. Klaudia is Komodor’s AI agent which helps identify the root cause of issues in Kubernetes. It not only pinpoints their source but also provides meaningful context and explanations, helping teams understand why issues occur and how to prevent them in the future. To do so, Klaudia leverages Komodor's comprehensive dataset of past investigation flows, historical changes, events, and metrics to power precise diagnostics and actionable insights, with AI enhancing the ability to scale across the entire Kubernetes stack. Data gathered from hundreds of companies of diverse sizes, analyzed and engineered on top of, that amounts to hundreds of developer years, to optimize real-world Kubernetes operations at modern speed and scale. Why We Created Klaudia When we first set out to develop Klaudia, we weren’t entirely convinced AI could provide tangible value for complex Kubernetes-based systems, as we had hoped. In addition, we also weren’t sure where AI would have the most impact on our platform and on Kubernetes operations in general. Initial experiments, like a free-text and conversational advisor, resulted in spectacular failures—hallucinations, irrelevant recommendations, and overall poor performance. But we weren’t ready to give up. The potential was there, and we were willing to put in the research and effort to find it. We dug deeper. We realized that when we narrowed the criteria and created guardrails for refining the data and fine-tuning the query, GenAI performed significantly better. Rather than have an overwhelming amount of data to sort through all at once, despite the same data being fed to the AI. As part of our research, we discovered that the difference in accuracy and hallucinations when feeding a large data set at once, and in our case, piecemeal and query by query, produced significantly different results in terms of precision and accuracy. Another interesting thing to note was the remarkable improvements with each version upgrade of the models we tested against. Moving from one AI model to another—starting with ChatGPT and evolving through Claude3 and Bedrock—each change brought exponential boosts in accuracy and reliability. By refining our approach and only feeding the model relevant pieces of data in a piecemeal and gradual manner, we discovered that the very specific and narrow use case of root cause analysis (RCA), delivered immense value. Similar to the traditional “5 Whys” practice of root cause analysis, where human engineers keep asking “But why?” until they reach the root cause; it turns out the machines work in very much the same way––and understanding this quirk elevates its capabilities by orders of magnitude. It was playing to the AI’s strengths. How Klaudia Works Klaudia is integral to the Komodor platform. Once our system detects an issue or a bad state within your Kubernetes environment, Klaudia springs into action. It conducts an in-depth investigation (in the background - so need to waste precious time in a holding pattern), systematically requesting additional data points and providing step-by-step guidance on how to resolve the issue. At Komoder we developed a custom RAG (Retrieval-Augmented Generation) model to fetch data directly from Kubernetes and our extensive Komodor telemetry and historical data, ensure that Klaudia’s recommendations are both relevant and precise. This also provides the added benefit of having control over the data being fed to the AI, a critical aspect we’ll get to shortly, and greater customizability through fine-tuning, not to mention speed and scale. Through an iterative process, Klaudia continues to query and refine the data set, until it determines the investigation is complete (or after 10 iterations), guaranteeing a thorough analysis without overwhelming the user. Finally, the results are presented to the user, complete with supporting evidence––enabling the user to follow what has been detected to ensure value or implement manually, and as an added bonus learn from the AI’s RCA analysis. We’ll demonstrate below how this works in practice, so you can get an idea. Real-World Example: A pod failure occurred due to a corrupted key inside a changed ConfigMap. Although the ConfigMap was properly mounted, it contained malformed data. This was not immediately obvious and required deep analysis to diagnose. The Klaudia agent took just a couple of seconds to digest logs, historical changes, and K8s events, and flag the root cause. It then provided clear instructions for remediation, including a direct link to the right ConfigMap. Why Klaudia Stands Out Compared to other tools on the market (and we’ve tried them all - and also spoken to the engineers developing them)—Klaudia has proven to be a game changer, particularly for developer autonomy and productivity. While other tools struggle with generalization and often overwhelm users with unnecessary information and errors, Klaudia is narrowly focused on Kubernetes RCA, ensuring that only the most relevant data is considered. We’ve monitored the results closely and gathered the following stats: Experienced ops engineers can save much-needed time on troubleshooting from minutes to hours by receiving suggestions in mere seconds, without having to invest too much manual toil. While that’s great it’s not really the game changer. The biggest advantage is the autonomy afforded to non-experts. We’ve found that non-experts, typical developers, and the bulk of engineering teams, don’t need to escalate common issues at all. A major advantage. Instead of bogging down already overwhelmed operations engineers, they can autonomously troubleshoot - without having to be experts in the tools, stacks, or interdependencies. 🤯 So if we refer back to the previous three bullets of why troubleshooting is hard, and assumes a lot of prerequisite knowledge, even with great platforms (like Komodor) that help on the operations side––Klaudia now democratizes the prerequisite knowledge gap to all engineers, which is a huge leap for AIOps, and restores trust in its original founding purpose. We’ve been dogfooding Klaudia ourselves since its early design, and it has been delivering significant value to our very own teams, alongside feedback from design partners and early users. What Sets Klaudia Apart Klaudia makes this possible through a unique and proprietary approach. Eventually, one of the most common problems in operations is the “garbage in, garbage out” challenge. This means that if the data being reviewed or fed to the model is subpar, the recommendations will also be of the same quality. Building upon a foundation of clean, well-structured data, and investing heavily in algorithms with error thresholds that guarantee relevant and fine-tuned data, allows us to minimize hallucinations and ensure that Klaudia delivers consistent, trustworthy results. While this approach results in longer calculation times (and higher costs for us), the trade-off is clearly worth it: we’ve seen more accurate results that save both time and resources––that can be applied immediately. Just look at the outputs of 4 different K8s AI agents who were tasked with solving the same ConfigMap issue as Klaudia. None of them came near the precision in RC detection and remediation instructions. They either fail to zero in on the actual cause or they don’t offer any context or insights regarding the next steps. If you had a severe incident in production, who would you trust to give you the right answers? We feel we’re finally onto something and on the path to breaking the cycle of broken trust in AIOps by developing a tool that our own teams at Komodor use every day. Unlike other platforms whose AI features are largely bypassed and disabled due to reliability concerns, we rely on Klaudia for every investigation, and it has become an indispensable part of our operations. How to Get Started with Klaudia Klaudia is baked into our platform, ready to assist you every time you click on an unhealthy resource. Once an investigation is initiated, you’ll receive an in-platform notification when it’s complete—happening seamlessly in the background, saving you valuable time. Ready to experience the future of AIOps? Get started with a 14-day free trial today and see how Klaudia can revolutionize your Kubernetes operations. Or, join us for a deeper dive into Klaudia at our upcoming webinar on Sept 25, 1pm ET. Register now and learn more about how Klaudia is setting new standards for AIOps in Kubernetes management. Can’t make it? Don’t worry, the webinar will be available on-demand after the event.