GenAI Meets K8s: The Role of Generative AI in Overcoming Kubernetes Complexity

Itiel Shwartz. GenAI Meets K8s: The Role of Generative AI in Overcoming Kubernetes Complexity
Itiel Shwartz
Komodor's CTO and Co‑Founder
Andrei Pokhilko. GenAI Meets K8s: The Role of Generative AI in Overcoming Kubernetes Complexity
Andrei Pokhilko
Open Source Dev Lead
Udi Hofesh. GenAI Meets K8s: The Role of Generative AI in Overcoming Kubernetes Complexity
Udi Hofesh
Komodor's K8s Advocate

Please note that the text may have slight differences or mistranscription from the audio recording.

Udi Hofesh: Hello and welcome to another webinar by Komodor, and we’ll just wait a couple of minutes to let everyone join in, and then we’ll start the conversation. Welcome Serio, Subash, Sunita, Prashant, Prasad, Oer, Phil Edwards, Mohini M, and Sh. Welcome, Rafhael and Robert, I see we have some returning guests here. Alright, let’s kick it off and let’s go. Okay, so, welcome again to our webinar on Kubernetes and GenAI, a topic that’s been talked about a lot, and you hear it everywhere, but we have an interesting angle to discuss today. We just recently launched Klaudia AI, an advanced troubleshooting agent for Kubernetes.

Andrei Pokhilko:Yeah, it’s actually our first experiment—or not an experiment, but a product—at that scale of applying AI. This is when we got really serious, so it’s a bit special for us. Before that, we didn’t invest that much into AI because we were cooking it in our lab, and now, after we released it, it’s a special occasion for us, and we’re glad to finally speak about it after so many months of silence.

Udi Hofesh: Speaking of our lab, we have two mad scientists behind this project. I’m lucky to share this space with Andrei, who came all the way from Portugal just for this live webinar. Andrei has a history of contributing to open-source projects. He’s the brain behind Helm Dashboard and Komoplane, popular open-source projects, and also Klaudia, which we will discuss. Itiel is Komodor’s co-founding CTO. He used to be a developer, turned DevOps, and has seen everything from small startups to large-scale enterprises. Itiel isn’t with us here physically, but we do have his image here with us. So yeah, happy to be here. Before we get started, I just want to thank everyone for joining in. If you have any questions, feel free to drop them in the chat below, and we’ll address them as we go on. Everything will be recorded, and we will share the recording and any other materials and links with you after the webinar.And I will moderate this session for us and lead the conversation. I will do my best. Yes, I’m Udi, DevOps Advocate at Komodor, and we also have with us Nikki behind the scenes, taking care of everything that you cannot see and some things that you will see. So, let’s get right into it. Itiel, Andrei, when did you first start paying attention to AI or GenAI? It’s been a thing for a while, but when did it catch your eye?

Itiel Shwartz: I think that when we looked at AI—even when we started the company, actually, AI was always there, mainly because AI always sounded like the most logical thing to do. In IT, you have a lot of data and a lot of repetitive work, so using AI to solve that makes total sense. But when we started the company, we made a decision not to use AI, mainly because we felt it was overselling and underdelivering. AI solutions weren’t that great in helping to solve DevOps issues. We resurfaced our interest in AI when OpenAI became popular with CoPilot. That was actually the turning point for us—seeing if we could actually use AI to make our lives easier and our customers’ lives easier. OpenAI changed the picture for us. I use CoPilot, I’m happy with CoPilot, and I also use Claude and OpenAI quite a lot, which has been great. But with both Claude and OpenAI, you still have a human in the middle. When we built Klaudia, we aimed to have no human in the loop, meaning a human is not expected to interact with the AI during troubleshooting. That’s very challenging because most AI products assume a human will always review the results. We built Klaudia for a very specific domain—Kubernetes troubleshooting. Our uniqueness at Komodor lies in the large set of data that’s unique to us, combined with our Kubernetes experience. We narrowed the focus to Kubernetes troubleshooting to ensure Klaudia works out of the box. We’re trying to automate the steps a developer or DevOps engineer would take, starting from detecting changes, logs, events, and eventually providing actionable insights. It’s about simplifying the complex and making the developer’s life easier. As time progresses, we’ll continue automating more scenarios, moving from basic troubleshooting to even more complex cases.

Andrei Pokhilko: Before that, we didn’t invest too much into AI because we were cooking it in our lab, and now, after releasing Klaudia, it’s a special occasion for us. We’re glad to finally speak about it after months of silence. Speaking of our lab, we have two “mad scientists” behind this project. I came all the way from Portugal for this live webinar. I have a history of contributing to open-source, leading projects like Helm Dashboard and Komoplane, and now working on Klaudia. I was skeptical of AI at first. I didn’t see immediate value, and I thought it might just be a buzzword, but when they released ChatGPT and made it freely available, I started experimenting. That’s when I began to see the potential, even though I still think it’s not that smart. The technology behind it, though, is very advanced and massive, and I realized it could replace traditional search engines. It’s impressive how much knowledge it can process. While I’m still not a big fan of CoPilot for coding because it interferes with how I write, I can appreciate how AI is evolving. I was finally convinced after we built Klaudia. This tool has proven AI can build something game-changing for the market. We’re just scratching the surface, and I’m excited about the future of AI and the products that will use it. I’m particularly interested in AI as an agent, not just a generator. We need to move from simple chatbot interactions to a goal-oriented, task-oriented approach. Give AI a goal, and let it figure out how to achieve it, rather than being tied to conversations alone. We can do much more with AI if we give it “eyes” and “hands.”

Udi Hofesh: That’s an interesting notion. You know, Marshall McLuhan said back in the ‘70s that humans are the reproductive organs of the machine world, and now you’re saying this is exactly what we are. That’s smart—or maybe not even reproductive organs, just the limbs and eyes. Okay, let’s bring it back to our world. We talk a lot about the skill and knowledge gap between SREs, DevOps, platform engineers, and application teams—or developers. How can GenAI help bridge that gap? And what potential is there for GenAI in the world of Kubernetes, where there are numerous challenges and the knowledge gap is just one of them?

Itiel Shwartz: I think the promise is always to have this AI assistant that will do everything for you, right? It’s true for DevOps, it’s true for home assignments, it’s true for everything—how can we automate the boring, non-creative stuff, basically? When it comes to platform engineers versus developers, we’re starting at the bottom—like developers doing troubleshooting. When DevOps is doing troubleshooting, it’s a very complex situation. A lot of the time it’s a combination of, “I have a problem with the ingress, and with the PVC, and with the nodes,” and so on, so it’s quite hard to automate that. On the other side, when we look at the tasks developers are doing when it comes to troubleshooting, it’s often much simpler. You go, you see what changed—because most issues originate from changes—then you read the logs, you read the events, and you do a revert or increase limits, or something like that. That’s the honest work of developers when it comes to Kubernetes troubleshooting. Currently, even that is quite hard for developers. What I just described requires expertise, experience, access, and so on. What we’re trying to do is look at this pyramid of what developers are expected to do versus what DevOps engineers are doing. We want to start automating each part of the way, meaning we’ll take some of the hard parts that can be automated and do the work for you. In Komodor, we started focusing on developers and building tools that are perfect for developers to use when troubleshooting, essentially taking them from a novice level to an expert level using GenAI. I think as time progresses, we’ll see more and more complex scenarios being handled by GenAI. At the end of the day, DevOps engineers don’t want to troubleshoot a very complex scenario—that’s life. It will probably be harder to automate, but that’s the ambition.

Andrei Pokhilko: For me, Kubernetes is a very good field to apply GenAI because Kubernetes requires a lot of knowledge. There are so many things to know, and that’s just one part of it. The second part is the amount of information you need to deal with. We know that humans don’t like dealing with huge amounts of information—machines are fine with processing enormous volumes of data, but humans are bad at it. We have a really good combination when that brain remembers the whole internet, basically. That’s the neural network behind the language models—it’s the network of remembering whatever it was trained on. It remembers the whole internet, which is amazing. This is why I’m replacing Google with it. Then you take the volume of information that happens in Kubernetes—the complexity of relationships between objects, all of that difficulty—and it becomes a perfect place to apply the intelligence of language models. I don’t know how to overcome the complexity of Kubernetes, which expands and evolves in such a dynamic way. I don’t know how we can keep up at scale without offloading these intellectual tasks. We’re not just offloading manual, tedious tasks but also difficult intellectual tasks. You need to know a lot and process a lot of information, and now we’re offloading that. This step we’re taking is very natural. Without it, we’d be stuck because there are limits to our squishy human brains and bodies regarding how much we can process in a given time period. So to be short, this is the best place to apply AI, in my opinion.

Udi Hofesh: Yeah, especially given that Kubernetes environments are only growing and growing in scale and becoming more complicated, with new versions coming out every six months or so. So, what do you think is the key to simplifying Kubernetes? Where do you put your focus? Is it simplifying the interface so you don’t need to learn kubectl commands, or is it something else? Is it just eliminating toil? Where is the silver bullet, if it exists? If there’s one area where you can target your AI agent to make it simple?

Andrei Pokhilko: In my opinion, the best strategy would be to keep processing all the information or address everything. The good systems these days around Kubernetes, in my opinion, are not about eliminating kubectl. It’s about eliminating the need to constantly look everywhere. You need to be event-driven. Something needs to happen that will attract your attention. So, it’s either the problem that is happening in your cluster or some report presented to you periodically, or based on some events, that shows you the potential areas where you should apply your efforts. Just sitting in front of the computer—even if we take not kubectl, but Komodor platform—Komodor platform is still a way to present your clusters and all the volumes of information. You don’t want to go there just to observe in case anything bad happens. You would like to be notified, to be presented with the top interest things in your cluster. Something that is really standing out, violating your internal standards or industry standards, misbehaving, consuming too many resources, costing too much money, or offering you to save a good amount of money if you apply some resources. So, these insights, when they come to you at the right moments, that’s when you should get involved with Kubernetes. Otherwise, the number of things to just go and look at is not processible by humans. So, the ideal thing is something that detects the problems or areas of interest for you first, and then you involve AI or a human to act on it. That’s my take.

Udi Hofesh: Alright, so let’s show some visuals, and…

Andrei Pokhilko: Yeah, yeah, basically, this is our vision. The things I’ve just said can be illustrated with this diagram because one of the learnings we have from implementing Klaudia, experimenting a lot with it, and finding the right way to apply AI that worked well for us, is that you do not start with AI. You start with something deterministic, classical algorithms that you use to detect the areas of interest, detect the problems, detect the events that deserve to be investigated by human or machine. Then, of course, you don’t want to spend human time, so you start to spend machine time. You come to Klaudia when the deterministic automation has found something worth investigating. Klaudia does the deep investigation and does its thing.

Udi Hofesh: So, let’s go through this diagram. Do we have a version that’s bigger, like on other slides?

Andrei Pokhilko: Yeah, I think let’s put it, and I’ll give some segue. After all this talk, at Komodor we bet quite a lot of resources on working on a GenAI solution for Kubernetes. We had a lot of trial and error to perfect it, and what we’re going to show now is the internal process of how we harness GenAI to help users solve issues in Kubernetes. Basically, for every kind of Kubernetes issue—every time a pod is restarting for no apparent reason, your deploy fails, your node is misbehaving—Komodor with Klaudia, which is the name of the GenAI agent, will allow you to understand what happened and, hopefully, help find the root cause. We built Klaudia in a way that is trustworthy and also explains to the user the evidence that led us to do what we’re doing. Andrei is now going to talk about the different mechanisms, but this is the mechanism of GenAI designed to solve issues.

Andrei Pokhilko: Absolutely. Once again, I will share a bigger view, and I will zoom in as we continue with the flow. So, let’s start with the input and describe the principle of how Klaudia works. The stages are crucial for the results’ quality. We found that it’s important not to start with a language model or to apply AI to a broad task, which many players in the market do. They try to give a broad task definition, a lot of information, and say, “AI, please find whatever deserves to be investigated.” But our approach is the reverse of that. We use existing methods. This is something we have in Komodor from the very beginning—we are able to detect problems in the cluster, and that doesn’t require intelligence. You can use deterministic ways of doing that. But once you narrow down to a specific workload, pod, or job, it’s much easier to explain the task to AI. We explain that we are working in the context of this workload or Kubernetes resource and that we want to investigate, find the problem, and give suggestions. We want to do that in a newbie-friendly manner. That’s also important because we assume the person consuming the results of Klaudia has less experience in Kubernetes. The Komodor platform simplifies Kubernetes for consumption, and democratizes the technology. So, we identify the task, give it to Klaudia, and then Klaudia does its thing.

Udi Hofesh: So, what does it do? Can you explain?

Andrei Pokhilko: Yes, it’s the principle of giving it eyes and hands. We’re trying to apply this successfully. When we let the machine query the data, there is this RAG (retrieval-augmented generation) pattern of using AI. We are using the RAG pattern, but we modify it. We make it iterative; it repeats. It’s the right thing to do—go and repeat the pattern. After we’ve identified the problem, there’s a small step, but it’s simple: we direct the flow based on the area. Is it the workload of a certain type? Is the context of the problem something we know? For example, out-of-memory deserves its own investigation flow. After we’ve chosen the big route of investigation, we do iterative RAG. The machine iterates, and it helps us understand which piece of data to bring in next. This is something you would see if you used Klaudia on the screen. It shows you the steps it’s taking during the investigation. For example, if you ask it to investigate a deployment, it would start by checking the status of it, querying the YAML or describing the deployment. Then it chooses the unhealthy pod in the deployment, looks at its events, then the next iteration checks its logs. Maybe then it sees that the config map is involved, so it starts querying the config map. Basically, it can go anywhere in the Kubernetes cluster. That’s why it’s so good and powerful—it’s not limited by a narrow scope. It goes wherever it needs to find the information. All that information accumulates in the context of the language model, and with each step, it gets closer to the conclusion of the root cause. Eventually, it gets to it, and we present that as a result. We do some additional clarification on its conclusions to make sure the result is final, and then we present it to the user. It takes a bit of time. You can hear me explaining the steps, and you can imagine it takes a bit of time. But only then, after spending this time and accumulating context, do we get really good quality results.

Udi Hofesh: I’ve seen the process, and it’s quite impressive. I’ve got to say, with each step, it seems to provide more accurate results. How is it compared to other products, though? How does it compare to tools like Splunk or BMC Helix?

Itiel Shwartz: Great question. Let me first answer how we’re different from Splunk, for example. With Splunk, you’ll see the log—the application is crashing, exit code one, a variable is missing or misconfigured. But a good SRE will not stop there. They will go to Kubernetes, use kubectl or K9s, check the pods, the YAML, the config map, maybe Vault. They’ll do a lot of things that aren’t in the logs. Klaudia and Komodor are built to take all this into account. You need the ability to view the config map, the YAML, the events, but you don’t want all that data upfront, because then you’d be confused by too much information. So, our approach is to start with a very simple problem and then give Klaudia the muscle to fetch more data points as needed, rather than throwing all the data on the model. We’re using an iterative process, fetching the data on demand. That’s how we compare to tools like Splunk or even anomaly detection systems like Spunk.

Udi Hofesh: I see. And in terms of other comparisons, like with BMC Helix?

Itiel Shwartz: To be honest, I don’t know much about BMC Helix. We’ve done comparisons with top-class solutions like Google’s open-source Gen project, Datadog, and Dynatrace, and I can confidently say Komodor is about 10x better when it comes to solving Kubernetes-related problems. Our answers are better, we don’t hallucinate, and we provide relevant evidence. We have a playground with about 30 different scenarios we keep testing, comparing our solution with others. We’re also constantly experimenting with the prompts and data inside Klaudia, releasing new versions on a weekly basis. As for Helix, I don’t know, but we will add it to our comparison metrics.

Udi Hofesh: That’s very interesting. Now, I know the audience is eager to see Klaudia in action, so let’s go ahead and do a live demo.

Itiel Shwartz: Sure! Let me share my screen. Here’s the Komodor platform. I’m not going to dive too much into all of the platform’s capabilities—how you can see all the clusters, their states, reliability, and issues over time. Instead, I’ll focus on Klaudia. Klaudia works whenever Komodor detects a problem or issue in your system. Let’s look at a service that’s currently unhealthy. This is a Kubernetes workload. You can see not just the current state but the entire history. We have this kind of time machine for your system, so you can go back in time and understand what happened. In this case, you see there was a successful deployment and then a failed one. Once I click on the failed deployment, I see a full analysis by Klaudia. It’s basically telling us what happened—there was an API rate limit misconfiguration causing health check failures, so the deployment doesn’t have all replicas available. The fund condition shows some readiness issues inside the logs. Klaudia detected a problem with the API rate limit, checked the config map, and then told us to change this value here in the config map. So, Klaudia did all the analysis for us, simplifying a complex problem. Let’s also do a live investigation on one of the unhealthy pods.

Udi Hofesh: That’s great! What’s happening now?

Itiel Shwartz: Now, Klaudia is starting the investigation. It’s analyzing the YAML, fetching the events, getting the relevant logs from the container, and querying the config map. You can actually see Klaudia working in real time, thinking about what the correct root cause is. It’s now finalizing the conclusion—again, the issue is with the API rate limit. Klaudia tells us to reduce the value in the config map to solve it. Imagine having a tool that, whenever you have a problem, takes years of experience and the entire internet’s knowledge to help you solve the issue as quickly as possible. Whether you’re a Kubernetes expert or not, it works for everyone. This is our vision—to simplify troubleshooting for every kind of Kubernetes problem.

Andrei Pokhilko: That was a good demonstration. I hope everyone noticed that we gave the language model two different routes for the same problem—one from the deployment and one from the pod—but the model is flexible enough to come to the same conclusion through different paths. It quickly settles on the root cause, which is impressive because it shows the model is not deterministic and rigid. It’s flexible, which is cool. It means next time, when there’s a different problem, Klaudia will adapt to that new problem and speak the language of that problem.

Udi Hofesh: Yeah, and what’s really impressive is the speed. Klaudia did all the thinking in real time, and it took about, what, 10–15 seconds?

Itiel Shwartz: Yep, around 15–20 seconds. That’s super fast, and it’s amazing when you think that for most use cases, 20 seconds is just the time it takes to find the right kubectl context and log into the relevant cluster!

Andrei Pokhilko: When we were developing Klaudia, the investigation time was around a minute, and I was telling everyone it’s already cool. Instead of an SRE having to dig through different tools, they could just grab a coffee and come back to a full report. Now, with 15–20 seconds, you don’t even have time to grab coffee anymore!

Udi Hofesh: Well, you ruined the coffee break then!

Itiel Shwartz: Exactly. Maybe just have a quick chat with a colleague instead! Anyway, that’s the reality now. Let me address some questions from the audience—one was about logs. Where are we fetching the logs from? We fetch them from within the cluster. If your logs are routed somewhere else, like AWS CloudTrail or Splunk, we can fetch the data from there too. But natively, our core integration is directly from Kubernetes.

Andrei Pokhilko: And Klaudia doesn’t store all the logs. It doesn’t care about all the logs. It fetches a small snapshot of the relevant logs for the specific issue it’s investigating.

Itiel Shwartz: Exactly. Klaudia takes the relevant logs, config map, events, and other key data, and works with that. It doesn’t need to see everything. We’ve designed it to be efficient in its data collection.

Udi Hofesh: Perfect. Now that we’ve seen Klaudia in action and answered some questions, I want to ask you both—what is your vision or dream for Klaudia? No commitments, just dreams.

Itiel Shwartz: For me, it’s about solving as many problems as we can and providing relevant remediation. The world of troubleshooting is so vast, and we’ve just started poking around the edges. If we can improve investigation and remediation, that’s the dream—someone else doing the hard work for you.

Andrei Pokhilko: For me, as a developer, I want to apply the principle of turning the language model into an active agent, not just a chatbot. Applying it to more situations and use cases, that’s my vision. I love what Klaudia is doing right now. It’s resolving use cases and conditions really well. But I want to build something bigger that addresses even larger problems—capacity planning, for example. There’s so much power in the knowledge contained in these models. We’re only scratching the surface of what AI can do. I want to apply it to more problems, not just troubleshooting. Who knows? Maybe one day we’ll give the machine more freedom. The more I work with this technology, the more I see that the usage patterns are not obvious. We can get to that place at some point.

Udi Hofesh: That’s exciting! The future looks bright for Kubernetes and AI. Now, let’s address one final question from the audience. Does Klaudia have a feature to generate alerts and remediation scripts when there’s an issue with a network policy or communication with pods in a namespace?

Itiel Shwartz: We will be there. This is a core part of the Komodor platform. We have detection and notification features already built in. Klaudia focuses on investigation and recommendations. Komodor does the detection and alerting.

Andrei Pokhilko: Yes, we already have detection and notification features. Now, with Klaudia, we’re just adding that extra layer of investigation to make the content of those notifications much more actionable.

Udi Hofesh: So, we’ve covered the present, the future, and now even the past. I think this is a good place to end the webinar. Thanks again to Andrei and Itiel for showcasing Klaudia and for all the hard work you’ve put into building it. Thanks to Nikki for setting everything up. And for the folks back home, thank you for joining us. We’ll share the recording and deck with all of you. Until next time, see you!

All: Thank you, bye!