Home
Resource library
Webinars
Are AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations

Are AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations

Asaf Savich

Team Lead, Komodor’s CTO Office

Udi Hofesh

K8s Advocate

In this webinar, Asaf (Team Lead, CTO Team) and Udi (K8s Advocate) from Komodor, explore whether AI SRE agents are ready for prime time and how to build trust in autonomous operations. They discuss the growing complexity of site reliability engineering, the challenges of evaluating AI tools, and introduce Klaudia, Komodor’s AI SRE agent. Asaf shares how Komodor uses a Failure Playground of real Kubernetes issues and LLM-based evaluation to benchmark performance and build confidence. A live demo compares Klaudia to an open-source tool, highlighting its ability to identify root causes, provide evidence, and recommend remediations. The session concludes with audience Q&A on autonomous capabilities, integrations, and adoption strategies.

TL;DR: Why asking if AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations matters:

SRE is critical but overstretched: Few experts, growing complexity, and constant change make reliability hard to maintain.
Downtime is costly: Outages damage reputation, slow velocity, and increase operational expenses.
AI augments SREs: Agents like Klaudia help manage overwhelming context, speed up troubleshooting, and reduce firefighting.
Trust is essential: Clear evidence, accurate root cause analysis, and consistent evaluation build confidence in AI-driven operations.
Path to autonomy: Gradual adoption—from copilot to autopilot—paves the way for reliable autonomous systems.
Future impact: AI SREs will not only fix issues faster but also prevent them, driving scalable, efficient operations.

Key Takeaways from the Webinar:

SRE roles are increasingly complex and unsustainable without AI support.
AI SRE agents can significantly reduce downtime, costs, and operational friction.
Building trust in AI requires transparency, evidence, and rigorous evaluation.
Komodor’s Klaudia demonstrates how AI can accurately diagnose and remediate real failures.
Gradual adoption (copilot before autopilot) is the practical path to autonomous operations.
The future of SRE lies in prevention as well as faster resolution.

Webinar Transcript:

Please note that the following text may have slight differences or mistranscription from the audio recording.

Udi: Welcome everyone to this month’s Komodor webinar titled Are AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations. With us today is Asaf, who I’ll introduce in just a moment. A bit of housekeeping before we begin: this webinar is being recorded and will be shared with all of you, including the deck. We’ll have a Q&A session at the end, but if something urgent comes up during the session, we’ll try to address it live.

Today we’re going to answer the question: are AI SRE agents ready for prime time, and how do we build trust in autonomous operations which is what we all dream about. Let’s get started. Asaf, can you please share your screen and bring up the deck?

Asaf: My screen is up.

Udi: Perfect. Let’s quickly go over the agenda. We’ll start with introductions, then Asaf will take us through the AI SRE landscape and talk about the many tools and projects that have been appearing in recent months. We’ll discuss why it’s so hard to evaluate AI SRE tools, and then Asaf will explain how we’re doing it here at Komodor with our own AI SRE agent, Klaudia. After recapping, we’ll give you some practical tips and a demo both of Klaudia in action and our evaluation techniques, including the Failure Playground and other cool features that Asaf will show.

So, without further ado, Asaf, please introduce yourself.

Asaf: Thanks, Udi, and thanks everyone for joining. For those of you attending a Komodor webinar for the first time, I hope this will be the best one you’ve had so far. My name is Asaf, I’m the CTO here at Komodor, responsible for driving innovation and pushing the company forward.

In my previous positions, I co-founded GI, served as a Director of Engineering at both Kubia and M, and I’m also what I call an ice bath enthusiast. By that, I mean I did it once and will probably never do it again but it was an experience!

Let me start with a short story. A few years ago, I joined a startup as Head of R&D. On my very first day, as I was about to finish up for the evening, production went completely down. All authentication processes failed, and we were 100% down.

For me, it was a nightmare. First, because I was excited to start and wanted to make a good impression, and everything broke right away. Second, I didn’t know who to reach out to, as knowledge was scattered among different people. We were a small team, but still it was overwhelming.

Udi: And this wasn’t your first day as a developer or DevOps engineer you had experience.

Asaf: Exactly. I had already managed large R&D organizations and handled many production incidents. But this one caught me completely off guard. There had been no deployments that day, and no single person was available to investigate. Naturally, R&D blamed Dev, Dev blamed R&D. I’m sure that sounds familiar to many here.

We eventually discovered that the authentication provider we used had made changes on their end, which required us to update our systems. The actual fix took minutes, but collecting all the context logs, dependencies, understanding service relationships took hours. It was two hours of downtime, and I can’t stress enough how damaging that was.

It was very difficult, but maybe today we could have done something differently if AI agents had been available.

Udi: That’s a great lead-in. Let’s talk about the AI SRE landscape. There are so many tools out there open source, proprietary, chatbots, Slack-native, UI-based, niche solutions focused on networking or other domains. It can be overwhelming. Before we go into families of tools, though, I think it would help if you explained how you define Site Reliability Engineering and why it was missing in the story you just shared.

Asaf: Good point. SRE, as I see it, is responsible for production reliability. Everything that happens in production may be caused by many different people in the organization, but the SREs are the ones accountable for making sure production environments are stable and reliable.

It’s a very demanding job. Even before AI, SREs had to juggle a massive amount of context monitoring tools, logging tools, tracing, plus deep knowledge of the application itself. Think about your own organizations: when something major happens, how many people truly know how to resolve it? Usually just one, two, maybe three.

It’s becoming impossible to sustain this model. The complexity grows daily. We believe there’s no future for SRE without AI SRE agents. They must go hand-in-hand.

To recap: SRE is a very tough, demanding job. There are very few SREs compared to engineers. Even the best SREs, who know the systems and tribal knowledge deeply, are overwhelmed because the context just keeps expanding. They aren’t scaling in numbers, but the complexity is scaling rapidly. That’s why what we’re doing today isn’t sustainable.

Udi: Definitely. And in terms of the landscape, we’ve seen an explosion of AI SRE tools.

Asaf: Exactly. The number of tools grows every day. Every week we investigate and find new ones. It’s both exciting and overwhelming.

There are a few broad families. Some are open source, some proprietary. Some act as chatbots in Slack, some come with dedicated UIs. Some try to do everything, while others are very specialized for example, focusing only on networking failures.

But the common promise across all of them is: “We’ll help you solve SRE issues using AI.” Not necessarily 100% of the issues, but maybe 70%, with humans covering the rest. That’s still extremely valuable.

Udi: But we’re not at the point where AI can completely replace SREs.

Asaf: Exactly we’re not there yet.

Udi: So why is evaluation so difficult?

Asaf: Because the tasks SREs do are the hardest in the company. They’re handled by the most technical, experienced people. If it were easy, we’d have more of them. Mimicking their decision-making is extremely difficult.

Even the best SRE from one company won’t immediately s쳮d in another because context is so critical. That’s what makes AI in this space uniquely difficult.

Udi: Which brings us to Klaudia. Tell us how Komodor is approaching this.

Asaf: We’re building Klaudia, our AI SRE agent. For quite some time now, we’ve been working on techniques to ensure its quality. We asked ourselves: how do we validate that our AI SRE agent is top-notch, accurate, and up-to-date? With AI, this is especially hard because it’s far less deterministic than code. Even with code, validation is hard; with AI, it’s even more complex.

So, we created several approaches. One of the most important is what we call the Failure Playground.

The Failure Playground is essentially a system for simulating failures in any Kubernetes cluster of your choice. We built a repository full of common Kubernetes failures out-of-memory errors, networking issues, GPU problems, image pull errors, and many more. These aren’t fake or mocked issues. They’re real failures deployed into real clusters.

Udi: And why is that so important?

Asaf: Because it allows us to test Klaudia end-to-end. From the moment the failure happens to the moment it’s resolved, we can see if Klaudia handles it properly. It also allows us to compare Klaudia to other vendors, since these are plain Kubernetes failures that any AI SRE agent should be able to handle.

Another advantage is that customers can try Klaudia without waiting for a real failure to happen in their system. Maybe they only have two or three issues at a given time, but with the Failure Playground, we can provide dozens of demo failures. That really builds their confidence in the product.

And for us, it provides an immediate feedback loop. If a new version of Klaudia can’t handle one of the known failures, it won’t be deployed. It makes it easy to automate testing and ensure reliability.

Udi: That’s a really clever approach. And beyond the Failure Playground, you also use models as judges, right?

Asaf: Yes, that’s the second part of our evaluation. The world of AI moves incredibly fast. A model can be great today, and then a new one comes out tomorrow and suddenly we need to rethink everything. So we need a way to move quickly.

What we did was create a predefined set of rules about what makes a good AI SRE agent. It’s a long document pages of rules based on what we’ve seen over years of helping companies troubleshoot Kubernetes issues.

We feed this into a large language model and use it as a judge. It evaluates whether Klaudia’s answers meet those rules. But instead of relying on numeric scores which LLMs tend to inflate we always use comparisons. For example, is this new model better than the previous one? That’s much more consistent.

Udi: And Komodor has a lot of experience to back this up.

Asaf: Definitely. Komodor has been around for five years, and in that time we’ve seen countless failures across many large organizations. We’ve learned a lot about how different setups look financial institutions, automotive companies, enterprises of all kinds.

This experience is priceless. Feeding all that context into our evaluation process gives us a unique edge.

Udi: That makes sense. And benchmarking against other vendors is also part of it.

Asaf: Yes. We benchmark Klaudia against other tools regularly. There are so many out there, and the number keeps growing. We welcome competition it’s good for the market and pushes everyone forward.

Results vary a lot. Even without going deep, you can see that different tools take very different approaches to the same root cause analysis.

Our tip for anyone evaluating AI SRE tools is: simulate failures yourself. Don’t just trust the promises. Deploy failures in a staging environment and see how the tools handle them.

Udi: And today, we’ll show a demo where you compare Klaudia to K-GPT, a really interesting open-source tool.

Asaf: Right. We want to emphasize, this isn’t about proving which one is better. It’s about showing how you can evaluate these tools effectively.

Udi: Before we get to the demo, let’s pause for a second. Why are we even doing all this? Why did we start evaluating AI SRE agents in the first place?

Asaf: Good question. I don’t really know… No, I’m kidding. There are some very clear reasons.

Think back to the story I told at the beginning. That outage cost us a lot both in money and in reputation. Uptime is critical. We want to be as close to five nines as possible. This is a matter of professionalism.

Beyond reputation, downtime hits operational efficiency. If teams are constantly firefighting, they can’t focus on building. They can’t hit their goals.

And then there’s cost. Every production issue costs money. Sometimes, because we don’t get to the root cause, we just throw more resources at the problem larger nodes, bigger databases, keeping years of logs. Those workarounds add up.

An AI SRE agent like Klaudia means we don’t need 20 people firefighting. We don’t need to overspend on infrastructure just to stay afloat. We gain efficiency, reduce costs, and keep our focus on what really matters: innovation and moving the company forward.

Udi: Exactly, achieving business goals instead of just keeping the lights on.

Asaf: Right. And that brings us to the demo.

Udi: Perfect. Let’s dive into it.

Asaf: For this demo, we’ll use the Failure Playground. This environment has dozens of failures, bad image deploys, bad values in config maps, memory leaks, networking issues, storage errors, GPU issues, secrets misconfigurations. All very common Kubernetes failures.

Let’s pick one. I’ve prepared a simple local cluster. Here’s the scenario:

We deploy an app that starts with a secret set to “safe.” The app runs fine. Then, the secret is changed to “broken.” The deployment restarts to pick up the new secret value, and it fails because “broken” is not a legitimate value. Very simple, very real.

Udi: And without AI, how long would it take you to troubleshoot this?

Asaf: First thing, I’d check logs, if there are any. Then I’d review changelogs to see what was modified around the time of failure. But there could be many changes, and that can take time. It’s not trivial.

Udi: Okay, let’s run it.

Asaf: Here’s the app. It crashes. Now let’s analyze it first with K-GPT, an open-source tool.

K-GPT says: the config map exists but isn’t being used. It recommends deleting the config map. That’s not correct in this case. It’s a nice suggestion, but it misses the real root cause.

Now let’s ask Klaudia.

Klaudia starts by gathering context, analyzing the logs, and reviewing the changes. It reports: “The pod is crashing due to an invalid app mode value after a secret content error.”

It explains what happened, shows the evidence it used to reach this conclusion, and provides a remediation: “Change the secret to app mode with a value set to ‘safe.’”

Udi: That’s a huge difference.

Asaf: Exactly. And the key is the evidence. Users don’t just want to be told what to do. They want to see how the AI reached its conclusion. That transparency builds confidence.

Udi: Right. Building confidence is everything. You wouldn’t trust a human who acted without explanation. You need the reasoning.

Asaf: Exactly. And Klaudia not only explains and recommends but can also execute. For now, most organizations prefer to keep AI in copilot mode rather than full autopilot, but the capability is there.

Udi: Let’s open it up for audience questions.

Audience: Can Klaudia detect errors, alert, and even fix them on its own without being asked?

Asaf: Very good question. The short answer: yes, we have this capability. Klaudia can detect issues without human intervention, investigate, and even execute a fix.

But the bigger challenge is cultural. Trusting AI to fully autonomously handle production is not easy. For now, we focus on enabling organizations to adopt AI in safe ways for example, limiting autonomy to dev environments, or specific namespaces, or certain categories of issues.

We do have this in beta, and stay tuned for KubeCon; we’ll be announcing exciting updates about this capability.

Udi: So, the capability exists, but adoption is gradual, based on confidence.

Asaf: Exactly.

Audience: What do you think is the next phase? If AI SRE is the hype of 2025, what comes next?

Asaf: Interesting question. I think the next big thing is true autonomy. We’ve been talking about it since ChatGPT came out what if it could handle incidents end to end without humans?

We’re closer than ever. All the pieces are coming together. Imagine an incident at 3 a.m. being solved without waking anyone up. That’s where we’re heading.

Beyond that, I think prevention is the future. Not just fixing incidents but preventing them. An AI that reviews your infrastructure, identifies risky areas based on past incidents, and proactively strengthens them. That’s where we’ll go.

Udi: Prevention as the next frontier. Makes sense.

Audience: Can Klaudia detect issues beyond Kubernetes?

Asaf: Yes. Kubernetes is the cornerstone, but not the whole body. We also integrate with Git, ArgoCD, Jenkins, GitHub Actions, and more. Issues often show up in Kubernetes, but the root cause may lie elsewhere, and Klaudia connects the dots.

Udi: And what about CRDs and non-native Kubernetes objects?

Asaf: Good question. Even if they aren’t native, once they’re represented as Kubernetes resources, Klaudia can treat them like any other pod or workload. That means broader coverage across your workflows.

Audience: How do we set up Klaudia with our infrastructure?

Asaf: Easy. Go to komodor.com, click “Get Started,” and you can start a free two-week trial. You don’t need to talk to sales; you can try it yourself. Of course, for best results, it helps to work with one of our engineers. Then you can stress test Klaudia with real-life issues.

Audience: Can Klaudia interact with MCP servers?

Asaf: We experimented with this. Some customers tried connecting their MCPs. Results weren’t as good as we hoped mainly because sometimes you need a mediation layer to normalize the data. Without it, results are less robust.

So right now we don’t offer it generally. But we are always adding integrations metrics systems, Git, Argo, Prometheus, and more. And we do hear from customers who want to connect their organizational knowledge bases, which is something we’re exploring.

Audience: What about Prometheus? Can it use custom monitoring data?

Asaf: Absolutely. Prometheus is one of our native integrations. By pulling in those metrics, Klaudia has more context and can provide better answers and remediations.

Udi: Great. Any final questions?

Audience: Just one how are you so great at what you do?

Asaf: I’m not! We just have a very strong team at Komodor. My job is mainly to stay out of their way. But seriously, it’s an amazing time to be in the SRE space. Things are moving fast, new companies are rising, and it’s very exciting.

Udi: Yes, it really is. And at least in our space, the developments are positive and exciting unlike some other areas in the world.

Thank you so much, Asaf, for sharing your knowledge and experience. And thank you to everyone who joined and asked questions. Please check out Klaudia at Komodor, or use what you’ve learned here to evaluate other AI SRE tools.

Asaf: Thank you all.