The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a "naive" implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous. This article, based on my live session title, Building Quality-Driven Agentic AI in Noisy Big Data Environments, explores the technical realities of building Klaudia, an agentic AI solution for Cloud-Native infrastructure. We will dissect the architectural patterns, the "Swiss Cheese" validation model, and the hybrid intelligence strategies required to bridge the gap between hallucination and reliability. The Problem with "Naive" LLM Usage in Ops Before discussing solutions, we must address why standard GenAI approaches fail in operations. SRE is inherently complicated; finding human experts is difficult, and automating their intuition is even harder. When developers attempt to simply wrap an LLM around their logs, they encounter three primary friction points: The "Pleasing" Bias (Hallucinations): LLMs are designed to satisfy the user. In a chat context, this is a feature; in an SRE context, it is a bug. As Shwartz notes, "Sometimes it hallucinates. Sometimes it lies. Sometimes... I get things that look simply unreal". An agent without guardrails might confidently recommend a command that exacerbates an outage. Context Window Saturation: Even models with 200k token windows fail when flooded with the massive data streams typical of Kubernetes clusters. "For complicated domains, you can’t really reach that close to the context window because your LLM will start to hallucinate," Shwartz warns. Garbage In, Garbage Out: Without rigorous data engineering, the noise of big data overwhelms the signal. If you feed an LLM raw, uncurated data, you will receive a nonsensical response. To solve these issues, we must move away from a single, monolithic model prompt and toward a multi-agent architecture. The Architecture: Orchestrators and Subject Matter Experts To handle the complexity of cloud-native infrastructure, Komodor structured their AI, Klaudia, as a family of agents rather than a single bot. This mimics a human SRE team, where generalists coordinate with specialists. The Orchestrator At the top level sits the orchestrator (or detector) agent. Its primary role is not to solve every problem but to understand the context and delegate. It identifies that an issue exists and determines which specialist is required to diagnose it. The Subject Matter Experts (SMEs) The orchestrator calls upon Subject Matter Expert (SME) agents. These are highly specialized agents trained on specific domains, such as: AWS Expert GPU Expert Istio Expert vLLM Expert This modularity allows for reuse and precision. The same AWS SME agent can be utilized by different workflows, ensuring that deep domain knowledge is encapsulated and maintained centrally. By narrowing the scope of each agent, you reduce the noise they process and significantly lower the probability of hallucination. The Quality Framework: Trust Over Breadth When building Agentic AI, engineers often obsess over how many things the AI can do. Shwartz argues the focus must first be on trust. Komodor defines success by asking, "What would a really senior SRE do?". This leads to a strict hierarchy of needs for the AI: Do No Harm: The first rule of autonomous SRE is safety. An agent should never suggest deleting a namespace or editing a sensitive secret without massive validation. It must explain its reasoning and ensure its proposed remediation won't crash the system. Depth and Precision: Rather than trying to cover the entire Kubernetes landscape immediately, developers should focus on deep vertical integration. Komodor spent months solely mastering Pod issues, the most common unit of work, before expanding to Deployments, StatefulSets, and Nodes. Coverage: Only after trust and precision are established should the scope expand to edge cases. The "Swiss Cheese" Validation Model Perhaps the most critical technical takeaway from the webinar is the evaluation strategy. Because LLMs are non-deterministic, a single testing method is insufficient. Borrowing from the "Swiss Cheese Model" of risk management (referenced in a recent Anthropic paper), effective validation requires layering multiple imperfect defenses to create a solid barrier against failure. Komodor employs a four-tier validation suite: 1. Local Development (The Speed Layer) "Speed equals quality," Shwartz asserts. Building agents involves massive trial and error. Developers must be able to iterate fast, often running 50 to 100 iterations involving prompt engineering, data engineering, and tool calling adjustments for a single agent capability. Critical Requirement: You cannot use mock data. "If you fake the data you feed into the LLM, you are going to get a fake response". Local environments must use real, messy, complicated data to simulate reality. 2. Golden Standards (The Regression Layer) As LLMs evolve, they drift. A prompt change that fixes a network error diagnosis might break a memory leak diagnosis. To combat this, you need a set of Golden Standards, a library of over 100 specific failure scenarios (e.g., OOMKilled, ImagePullBackOff) used as regression tests. This ensures that as the AI gets "smarter," it doesn't forget basic competencies. 3. Shadow Agents (The Production A/B Layer) "Production is too late to fail", yet production is also the only place where reality truly exists. To bridge this, Komodor uses Shadow Agents. This involves running multiple versions of an agent (or multiple distinct agents) on the same production issue simultaneously. By comparing the outputs of the primary agent against a shadow agent, engineers can identify discrepancies and measure performance without exposing the user to experimental code. 4. LLM as a Judge (The QA Layer) Evaluating natural language responses at scale is impossible for humans. The solution is using an LLM as a Judge. A separate, high-reasoning LLM is tasked with grading the output of the SRE agents. It evaluates the response based on specific criteria: Did it identify the root cause? Is the evidence sound? Is the remediation safe?,. Hybrid Intelligence: The Unsung Hero While Generative AI gets the headlines, it is not always the right tool for the job. Shwartz admits that for complex patterns like cascading failures, pure LLM approaches initially failed. The context was too broad, and the connections were too subtle. The solution was a Hybrid Approach: Traditional Machine Learning (ML): Used first to filter, cluster, and correlate the massive intake of raw logs and events. Generative AI (LLM): Sprinkled on top of the curated data provided by the traditional ML layer. "Traditional machine learning is very useful for filtering and clustering noisy big data before it reaches the LLM," preventing context overflow and hallucinations. This combination allowed Klaudia to achieve precision levels comparable to traditional Root Cause Analysis tools but with the explainability of an LLM. The User Experience: Explainability and Remediation The output of this rigorous architecture is visible in the agent's interaction with the user. When Klaudia diagnoses a crash loop or an availability issue, it provides a structured response akin to a human incident report: What Happened: A clear summary of the bad state. Related Evidence: The logs, events, or metrics that prove the diagnosis. Shwartz emphasizes that "it’s critical to have that sort of evidence" to build trust. Suggested Remediation: Actionable steps to fix the issue. Rejected Alternatives: Crucially, the AI explains what solutions it considered and rejected, offering insight into its reasoning process. Conclusion Building an agentic AI for production infrastructure is 20% prompt engineering and 80% custom tooling, evaluation, and monitoring. It requires a shift-left mentality where testing happens locally on real data, a multi-agent architecture that respects domain expertise, and a humility that prioritizes "doing no harm" over showing off. As Itiel Shwartz summarized, "It’s better to be silent than to give the wrong answer". By implementing hierarchical agents, rigorous "Swiss Cheese" validation, and hybrid intelligence, engineering teams can finally build AI SREs that don't just chat, but actually work. About Komodor Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit Komodor website, join the Komodor Kommunity, and follow us on LinkedIn and X. To request a demo, visit the Contact Sales page. Media Contact:Marc GendronMarc Gendron PR for Komodormarc@mgpr.net617-877-7480