Building Quality-Driven Agentic AI in Noisy Big Data Environments

Itiel Shwartz. Building Quality-Driven Agentic AI in Noisy Big Data Environments
Itiel Shwartz
Komodor CTO and Co-Founder

In this webinar, Itiel Shwartz, CTO of Komodor, presents strategies for building reliable, production-grade AI agents capable of troubleshooting complex Kubernetes environments without hallucinating. The session details the architecture of “Klaudia,” their agentic AI technology powering the Komodor autonomous AI SRE, emphasizing a multi-agent system that combines traditional machine learning for data filtering with LLMs for reasoning. The webinar concludes with a practical demonstration and a breakdown of their rigorous validation pipeline, which includes shadow agents and using “LLMs as a judge.”

TL;DR

  • Speaker: Itiel Shwartz, Komodor CTO and Co-Founder
  • Focus: The technical challenges and architectural solutions involved in creating trustworthy “Agentic AI” for noisy, big-data infrastructure environments.
  • Core Concepts: Transitioning from naive LLM usage to a structured multi-agent system utilizing orchestrators and domain-specific experts to ensure precision and safety.
  • Includes: A deep dive into the “Quality Framework,” a breakdown of validation techniques (local development, golden standards, shadow agent testing), and a live demo of the AI SRE diagnosing a Kubernetes crash loop.
  • Wrap-up: The session closes with key takeaways emphasizing that a hybrid approach (Traditional ML + GenAI) and extensive custom tooling are non-negotiable for success in production.

Key Takeaways

  • Structure Agents Hierarchically: Effective production AI uses an ‘orchestrator’ agent to manage specialized subject-matter expert agents (e.g., AWS expert, GPU expert) rather than a single generic model.
  • Prioritize Trust Over Breadth: The “Quality Framework” dictates that an AI SRE must first “do no harm” and provide explainability before attempting to cover a wide breadth of edge cases.
  • Implement “Swiss Cheese” Validation: Relying on a single test method fails with LLMs; instead, use a layered approach including local testing, “golden standard” benchmarks, and shadow agents running in parallel to production.
  • Adopt Hybrid Intelligence: Traditional machine learning is very useful for filtering and clustering noisy big data before it reaches the LLM, preventing context window overflow and hallucinations.
  • Use LLM as a Judge: Because LLM outputs can be non-deterministic, using a separate, high-reasoning LLM to evaluate the accuracy and relevance of agent responses is highly effective for QA.

Webinar Transcript 

Please note that the following text may have slight differences or mistranscription from the audio recording.

Ilan: Thank you everyone for your patience. Sorry for the technical difficulty. We’re going to go ahead and get started with today’s webinar: “How to build quality-driven agentic AI in noisy big data environments.”

Ilan: Yes, the webinar is being recorded. It should be about 30 minutes or so. If you have any questions, feel free to use the Q&A section to ask them. Before I let Itiel talk, I’ll introduce him, we’ll run a short poll, and then I’ll hand over the keys.

Ilan: Itiel is the Komodor co-founder and CTO. Komodor, for all of you that don’t know, is the autonomous AI SRE for cloud-native infrastructure. Itiel has been leading our AI initiatives at Komodor for the last couple of years, and he’ll definitely go into more details on that. If you haven’t had a chance to hear Itiel, he also hosts a podcast called “Kubernetes for Humans,” which comes out at least once a month. He has very interesting conversations with leaders from across the cloud-native and Kubernetes communities. Lastly, Itiel is obviously a big fan of Kubernetes and LLMs. Follow him on LinkedIn; he shares great posts, especially around these two topics.

Ilan: The agenda: Itiel is going to talk to us about how we built an agentic AI architecture within Komodor. Before that, I’m going to run a quick poll. One of the main hurdles we hear from people is creating AI agents that they can trust in production. These are some of the things that will be covered in the webinar today: where they can develop the agents, dealing with hallucinations, guardrails—especially for things that run in production—and consistency. A lot of the time, we don’t see consistency in answers because of the way GenAI works.

Ilan: I’m going to give the poll 10 more seconds, and then we’ll see the results. All right. So, only one person voted—I’m going to call that a small sample size—but they said they’re dealing with agent development time.

Ilan: Having said all that, I’m going to hand over the keys to Itiel, and he will get us started.

Itiel Shwartz: Okay, that’s good. In Komodor, our mission is to help our customers overcome the challenges of Kubernetes. That means we want to take all of those manual tasks, jobs, lack of knowledge, and lack of trust that customers experience with Kubernetes and simplify them in an autonomous way.

Itiel Shwartz: The problem is that it’s quite hard to build an autonomous AI SRE, mainly because the SRE job itself is super complicated. It’s very hard to find experts in that specific domain. Trying to automate something that complicated poses a lot of challenges. In the call today, my goal is to talk about those challenges, how we overcame them, and give my two cents on building agents in a very noisy and complicated environment.

Itiel Shwartz: The problem—and this is a general problem with LLMs and any kind of SRE—is that LLMs really love to please users. That means, on one hand, they are great. When I chat with Claude or ChatGPT, I get really good answers. But sometimes it hallucinates. Sometimes it lies. Sometimes, instead of getting a very curated response, I get things that look simply unreal. Using an LLM without guardrails and validation is almost absurd when it comes to SRE production use cases.

Itiel Shwartz: Even the best models have a very limited context window. There’s the generic context window for agents, like 200k tokens. We observed that for complicated domains, you can’t really reach that close to the context window because your LLM will start to hallucinate.

Itiel Shwartz: Other than that, just like with humans: garbage in, garbage out. If you give your LLM bad data, you are going to get a very bad response. You need to do a lot of work and training even before you feed the data into the LLM so it won’t get lost.

Itiel Shwartz: In reality, naive LLM usage tends not to work. I know there are a couple of very cool open-source projects in our domain, and Kubernetes GPT is one of them. But from what we observe, both in internal benchmarks and in reality, those things don’t really scale into production use cases.

Itiel Shwartz: For that, we built Klaudia. I’m going to use the term “Klaudia” quite a lot in this conversation. Klaudia is actually a list of families working together to solve complicated issues. Because Kubernetes is very complicated and the usage needs to be very tight in terms of data in and data out, we separated our AI SRE into a family of agents.

Itiel Shwartz: One of them is the orchestrator agent. For example, the detector: its goal is to detect issues on top of Kubernetes. But this detection is not happening in a void. The detection agent can also call the Subject Matter Expert (SME) agent. Those agents are experts in their domain. A domain might be something like GPU, AWS, Istio, or vLLM. The same SME agent can be used by different agent flows. So, on one hand, we built flows that are actual tasks a real SRE would do, but on the other side, they are utilizing those very small agents, where each one has its own domain.

Itiel Shwartz: Let’s say you want to build your AI SRE. You understand the general flow of how it should look. Now, what do you do? I think the first thing is to understand what success looks like for you. Start with something that makes sense for you and your users regarding the proper way to define a “win.”

Itiel Shwartz: For us, the key metric was and always is: “What would a really senior SRE do?” The first thing is that we shouldn’t do any harm, and we should explain ourselves. That means an SRE wouldn’t tell you, “Hey, just delete this entire namespace,” or “Edit that very sensitive secret.” Instead, most SREs will try to understand, “Okay, how can I make sure that I’m not going to mess up my system instead of fixing it?”

Itiel Shwartz: Second is depth and precision. Once we understood that we shouldn’t do any harm, we started to specialize in specific use cases. In Komodor, for the first couple of months, the only thing we focused on was how to solve Kubernetes pod issues in the best way possible. Think about Kubernetes; it is huge and has so many moving pieces. But we decided to start with one single entity, which we believed is the best entity to start with because it interests pretty much everyone: the Pod. Only after we felt confident enough with pods—and again, it took us a couple of months—did we start to add more resources like deployments, StatefulSets, nodes, PVCs, and so on.

Itiel Shwartz: Having the right mindset is one thing, but in reality, there are different approaches you need to take to really evaluate an LLM. There’s a really good Anthropic article released a couple of days ago that talks about the “Swiss Cheese Model.” Because LLMs are so tricky, you can’t use a single way to evaluate them. Instead, you need different tactics.

Itiel Shwartz: In Komodor, we have a combination of local development testing, golden standard benchmarks, shadow agent validation, and AI-powered validation. We have a lot of different tools and mechanisms. Each has its own pros and cons. Together, by having this full suite, we are able to confidently bring our users a very accurate AI SRE. Our current success rate or accuracy is around 93-94%.

Itiel Shwartz: First things first: local development. We discovered—and I think everyone who has tried developing LLMs knows this—that LLMs tend to be unreliable and unclear. It’s not obvious what will happen once you change a prompt. We are strong believers in local development. Have as much interesting data in your local environment or lab so you can do very fast iterations. The number here is 50 different iterations, but in reality, for a complicated agent, it’s hundreds of different iterations. In each iteration, we might change the prompt, the data engineering, or the tool calling.

Itiel Shwartz: In reality, once this agent meets production, we are going to get highly accurate results. LLMs require reality, not mocks. One of the first things we tried was simply having a scenario named “out of memory scenario” or “image pull back off scenario.” But the LLM picked up on the name immediately. If you fake the data you feed into the LLM, you are going to get a fake response. You need to invest a lot in how you are building those test case scenarios and your ability to feed very complicated scenarios into the LLM.

Itiel Shwartz: Second, speed equals quality. I don’t think there are very good ways currently in the market to build an agentic approach. It involves a lot of trial and error. To do a lot of trial and error, you have to be very fast.

Itiel Shwartz: Third is debugging with full context. When things fail in your LLM, it’s not clear if it’s because the LLM had a hiccup, if your data engineering was incorrect, or if there was a problem with the tool calling. You need to have all the things we are used to as developers—like debugging and diving into a particular area. This is even more acute with LLMs because the development is trickier.

Itiel Shwartz: Lastly, production is too late to fail. This is true for any kind of development, but when your users know they are interacting with an LLM, they will get bad service and distrust the LLM if it repeatedly lies or hallucinates. You need to catch most of the hallucinations and problems before going into production.

Itiel Shwartz: Next is the “Golden Standards.” We have around 100 different failure scenarios in Komodor right now. The goal is to move fast while having regression testing. When you are developing an LLM, you see it become smarter over time. But often, you forget that every prompt trying to fix a new problem might hurt the existing problems you already solved. In Komodor, we simulate a lot of interesting use cases and use this data to make sure we have enough coverage when moving forward. We have regression tests that help us stay safe even when moving at high velocity.

Itiel Shwartz: Shadow agents and experiments are essentially a fancy name for A/B testing. Because you can’t really trust your LLM, you need to have multiple LLMs running on the same task. Only then are you really able to understand if the first one or the second one did a good job. Even as a human, evaluating an LLM call is quite hard. Once you have more than one sample or agent trying to solve the same problem, you start seeing much better, more accurate evaluations.

Itiel Shwartz: We have different scores for different areas of the investigation. You can see here we have querying, root cause, and evidence. We give different scores to different areas to achieve really high precision.

Itiel Shwartz: Lastly, this relates to the previous point: “LLM as a Judge.” Because we have so much data and the world is so flaky, we use an LLM internally as a judge to make sure we are capturing all the different scenarios that happen in production. The LLM we built specifically for this is able to evaluate it in a safe, good manner.

Itiel Shwartz: The last thing I want to say here, which is a bit less sexy, is that traditional ML also has its place. I know this talk is about how to build the best LLM in the market, but to build a really good LLM, sometimes you need to go back to legacy approaches like normal ML. For a long time, we tried to solve cascading failures for our customers—detect, investigate, and remediate them. We saw that the LLM kept failing; we couldn’t reach a good enough state. We decided to use a combination of traditional AI/ML to help filter and cluster some of the data, and on top of that, we sprinkled a bit of LLM. The results were much better. We were able to achieve precision very close to normal Root Cause Analysis (RCA), but for LLMs, because of our hybrid approach.

Itiel Shwartz: So, how does it look? I’ll do a quick demo. Give me one second.

Itiel Shwartz: This is a Komodor test environment. You are able to see all of your different clusters and an overview of each. “Workload Health” shows workloads that are currently not running as expected. I can simply click on any unhealthy resource and get Klaudia’s response. What we can see here is a balancer deployment that is unavailable because of some issue. Klaudia always produces “What happened”—how we reached that bad state—and “Related Evidence.” As I told you, it’s critical to have that sort of evidence when working on these kinds of use cases. It also provides “Suggested Remediation”—how can I solve that particular problem—as well as a full explanation and rejected alternative remediations.

Itiel Shwartz: In reality, it looks very similar to this. This is how the investigation process looks, along with the end result and explanation. We have quite a lot of things here designed to help us feel more confidence.

Itiel Shwartz: If I go to a workload and an unhealthy pod, I should be able to see things quite similar to what I saw here. This is how Klaudia looks when it starts to investigate issues. It does everything a real human SRE would do. It asks the relevant questions. The goal is to be on par with the steps an SRE would take. If Klaudia takes too many steps or fetches too much data, it’s going to get lost. Too much data is bad; too little data is also bad because Klaudia won’t have enough information to operate. You always need to find the middle ground.

Itiel Shwartz: I see now, as with any live demo, it takes us a bit longer than normal. That can happen. As we are trusting an LLM, it might be a bit longer compared to a normal investigation. Let’s see what happened here: a pod out of memory, insufficient memory allocation. We have the “What happened,” suggested remediation, and I can also chat with Klaudia. I think developers are now much more used to talking with agents than they were a year ago. I can ask, “Why did this happen?”

Itiel Shwartz: Hopefully, the LLM will not only help me have something generic but also allow me to deep dive into that particular issue and how to solve it. The goal here is to give our customers and ourselves everything needed to solve very complicated problems.

Itiel Shwartz: Okay, I’ll go back to the slides. Just to summarize the main takeaways: There’s a hierarchy you need to build: trust, precision, coverage. I really believe that in a domain like ours, it’s better to be silent than to give the wrong answer.

Itiel Shwartz: Local development is key—shifting left is key—because trial and error is your only way of building trust in the system. Custom tools are a must. In Komodor, we currently spend 80% of our time adding a new agent into the system by making sure it has the right evals, tests, tools, and monitoring. Without that, you’re going to fail. I think a hybrid approach is the best way for very complicated scenarios. Also, LLMs can be very fun. Even myself, having built Klaudia from the ground up, sometimes I’m shocked by the accuracy and good results it provides to our customers.

Itiel Shwartz: Questions? Awesome. I see we don’t have any questions. If there are any questions later, feel free to reach out to us. Stay tuned for next month; we’re going to have another very interesting webinar on managing applications hosted across different environments with one of our tech leads, Michael Alil. That should be in late February, so be on the lookout for that.

Host: Thank you so much, Itiel. Again, Itiel Shwartz is on LinkedIn; you can follow him there. He shares lots of interesting posts, like the recent mention of the Anthropic “Swiss Cheese” model and using evals with Anthropic. Thank you so much to everyone who joined us this evening or afternoon.