Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Discover our events, webinars and other ways to connect.
Here’s what they’re saying about Komodor in the news.
Join the Komodor partner program and accelerate growth.
Webinars
Accelerating Kubernetes Intelligence Webinar Deck
Webinar Transcript
Please note that the following text may have slight differences or mistranscription from the audio recording.
Udi: Okay, I think we’re ready to get started. Everyone who has already joined us, thank you and welcome to the Accelerating Kubernetes Intelligence webinar by Komodor. This session is recorded, and the deck and the recording will be sent out to all registrants. We’ll have some time at the end for Q&A, but if you have any questions throughout the session, feel free to drop them in the chat, and we’ll try to answer them as we go along. Today we have two very special guests leading this webinar. Both are from Cisco. Cisco is a Komodor enterprise customer, and they are leveraging Klaudia, our AI platform, within their CAIPE platform. To discuss Cisco’s agentic and platform engineering journey, we have with us Hasith, who’s a CISO and Director of Platform Engineering, and Arthur, who’s an Agentic AI Engineer. Before we dive in, do you want to tell us a bit more about yourself and what you’re doing specifically at Cisco?
Hasith: Yeah, hi, nice to meet everybody. My name is Hasith, and I lead the platform engineering and security functions for Cisco’s accelerator called Outshift. Arthur, do you want to introduce yourself?
Arthur: Yeah, sure. Hi, I’m Arthur. I’ve been at Outshift for around a year and a half, and my main focus is Agentic Systems. I worked on the initial version of Jarvis and as we evolve that into CAIPE.
Udi: All right. Hasith will take us through Cisco’s journey from the very idea of why they even went on this adventure, throughout the various stages of testing, evaluation, implementation, and most importantly, the outcomes and impact that the CAIPE platform had. Do you want to start?
Hasith: Yeah. Feel free to ask questions. If it is appropriate, I’ll take them as we go. So first of all, what is Outshift? Let me start with that. We are Cisco’s incubation engine. We very much look at what’s coming up in the future, two to five years out, and look at how to incorporate those technologies into various products and strategy at Cisco. The current two areas that we have been involved in are one is quantum, so there’s various activities happening in the quantum space. You may have heard about Cisco doing this quantum entanglement chip, so that’s prototyped in our wider team. And then agentic AI. We have been doing various things in agentic AI for the last two years, including a release of open-source efforts to sort of help how the future internet of agents collaborating will happen and to facilitate that.
Now, before I talk about more details, this is a very high-level overview of the platform—the underlying SaaS platform that’s powering our cloud delivery—looks like at Cisco for the incubation unit. We have AWS as a single-cloud provider strategy for speed at the incubator, so we are using a lot of cloud services there. Then we have an edge compute strategy for when you have specialized compute workloads or data concerns, things like GPUs or content processing units, or if you have data that is stored, you have an edge player there. In all of that, there are a lot of Kubernetes clusters in play: in AWS in the form of EKS, and on the edge, we run MicroK8s variants. And then, in a similar notion, you have control components, things like GitHub and GitHub actions. We have a control and administration account with various functionality like Argo CD, Backstage, and some things around active security and vulnerability scanning. We use Splunk for all of our logs and telemetry. Then we have third-party SaaS functions currently at a high level. We’re using Lightning AI for a lot of MLOps work, so model training and whatnot. And then Komodor comes in to simplify Kubernetes and making it easier to consume across the board. That’s what it looks like at a high level. I’m sure you guys have very similar platform areas as the cloud-native world has expanded.
Now let’s think about the whole platform and platform engineering and where we have ended up. If you go back in time a little bit, we obviously had a split in Dev and Ops, then you had DevOps SRE being very popular. You had containers, Docker, Kubernetes, cloud native—everything exploding with a lot of possibilities and a lot of diversity in the open-source space along with the application architecture that evolved. Things like from monolith architecture into microservices, things become quite complex very easily. When you look at a modern application stack, it’s pretty complicated. This is why, and because it’s pretty complicated and people are doing things in different ways, we have platform engineering as a discipline that’s been introduced to streamline this function, which is a bottleneck by design. It’s not a bad thing necessarily, but because it’s a bottleneck, it’s always challenging to make sure that it’s working as effectively as possible. We are scaling this with humans at the end of the day, meaning platform engineers, SREs, etc. The important thing is when this is not functioning well, you’re going to end up with slow releases, a lot of frustration, and burnout, particularly when it comes to platform teams. AI and other changes happening can really help make this model scale further, because otherwise, today it’s very difficult to do it, and that’s why, while lots of enterprises are adopting platform engineering, it’s difficult to be successful in this.
This is our story in the last two years or so. I started in this role in January. What I was stepping into was a somewhat burnt-out SRE team because a lot of things were being pulled in lots of different directions. We did instill a lot of good platform engineering. But what was realized was for a small team like us to scale with the complexity of the modern application stack and all the wonderful things that we want to do at the incubator, it’s not going to be enough in terms of manpower. This is where we were very much thinking how to apply AI, and because of the agentic focus as well. This wasn’t a top-down project; it was very much a grassroots project driven by our needs to do things better. We started this project called Jarvis internally. A fun fact: it was literally in our heads back then. It was very much like, “Hey, can we do platform engineering like Tony Stark was building the suit in Iron Man one?” That was the quick one-liner vision back then. We started very small. Arthur was doing some exploration work. We had a couple of internships. Then, when we had to change the workflow engine, we used LangGraph. We combined all of this and created a multi-agent system towards the end of 2024, which was very successful. Obviously, a lot happened beginning of this year: things like MCP exploding, we have Agency Collective, which is our own work around Internet of Agents that has been open-sourced under the Linux Foundation, A2A from Google. We decided to make this a community-driven initiative under the group we joined called Cloud Native Operational Excellence, around forming a Special Interest Group around this. Around July this year, we open-sourced this effort as CAIPE, pronounced as ‘cape’ of a superhero, standing for Community AI Platform Engineering.
Let me quickly give you an overview of what this looks like internally. This is a quick view of our Backstage or internal developer portal. If you click the bottom right-hand side icon, that’s going to give you a chat window like this. Now you can interact with the agentic system. If you ask what it can do, it’s going to reply saying, “I can handle a lot of different things, from your dev and research elements to setting up your CI, setting your CD pipelines, other things like ‘I need the LLM key,’ ‘I need a development machine,’ ‘I need a Kubernetes cluster,’ ‘I need access to something,’ ‘I need to go troubleshoot something,’ ‘I’m trying to find information about something and I want to ask a question like an FAQ,’ or ‘I’m trying to correlate different data and I’m trying to build some insights.'” It’s very capable in terms of what we have been able to employ it for, and this is across the entire development CI/CD life cycle and a lot of functionality around it.
On the open-source project, the vision is to do this with the community for the community, staying with the cloud-native ethos. All of these modern SaaS problems and workplace productivity things—every enterprise faces a certain level of problems. Being able to share these things and learn and evolve together is a beneficial thing. Do check out the project, it’s CAIPE, you’ll have docs and GitHub, etc. At a high level, how does this look? You can have the developer and other personas on the left-hand side. You could interact with the agentic system through many interfaces. It’s all standardized with A2A, so we use WebEx for internal communication, so you can access it through your instant messaging. The portal I showed you was the internal developer portal on Backstage. You could assign the system Jira tasks and go back and forth on that. The agentic systems are very good at that because if the information is not complete, it’ll go between the user back and forth in order to make it complete. You also have more developer-friendly access through things like VS Code and command line.
In terms of the functionality, as I mentioned, there are a lot of functionalities around knowledge bases: things like documentation, playbooks, docs, or anything like that that’s been fed and RAG is performed. Any live tool calling, things like accessing a Kubernetes cluster and finding information, PagerDuty to find out who’s on call, Argo CD to figure out the state of the deployment pipelines, or Git for PRs or other GitHub actions-related things. We’ve also been very successful using Graph RAG. This is where, in a lot of the cases, you have separate information sources that are fragmented. By putting all the important things in a graph database and being able to do the relationships around it, you are able to make a lot of insights automatically. In fact, the open-source project has a pattern around that, including the RAG piece, the Graph RAG piece, and things like agents for doing automatic relationships between the data that is being ingested. And then things like self-service. If you think about it, the holy grail of platform engineering used to be to get to a form that the user can fill and click a button. You should still do that and strive to do it. But what an agent can really add and enable is making sure that the user gets to the right form. When there is going back and forth to fill the right values, an agentic system can do wonders by making that clarity and understanding users’ needs and making sure that the form is filled properly and the actual request is done properly. That’s where it’s a game-changer. It’s not so much that you’re not doing all the good platform engineering work you have done. It’s about doing all of that, but then using agentic AI to elevate that capability.
Now, in terms of what we have been able to accomplish with this: we are not a very big team, about 10 of us in the SRE type of functionality. We used to have three engineers dedicated for a support desk for all the things that were going on. We’ve been able to leverage the agentic systems to handle most of the toil and operational work, allowing the team to focus on innovative and creative work. Query responses, which used to take hours for somebody to notice somebody has asked a question, can happen within seconds because you have an agentic assistant at your fingertips. People don’t usually go through FAQs; they would ask most of the time even if you have the documentation. Certain tasks like “I need an LLM key,” “I need a development machine,” “I need access to something,” “I need a new repository with the pipeline set up”—those tasks are end-to-end automated, which used to take a half-day or a couple of days, are done end-to-end within minutes. Where the Kubernetes piece really comes in is where you leverage these types of troubleshooting capabilities. You can get up to an 80% reduction in mean-time-to-recover because most of the time the challenge is finding that needle in the haystack, and these agentic systems are really good at finding that information quickly. When you get woken up at 3:00 a.m., if you can get an event-driven agentic system to go and do the investigation and have a summary telling you what’s happened, you cut the mean-time-to-detect, and the mean-time-to-root-cause-the-problem and recover can really improve.
Moving on, back in February this year, we had this type of a sandbox experience of the incubation unit. As a developer, you could prompt, “Hey, build this app for me, push this to my sandbox, generate the configuration for it, and then deploy it.” Then, obviously, if it has failed for something like if you don’t get the ports right or if you don’t do the resource allocation properly or something else—you have a bug or something—you’ll have to go and troubleshoot it typically directly. You’ll go and check it on the Komodor UI, you’ll troubleshoot with Klaudia, which we were one of the first users of when it was released. That was the experience you had. Arthur will explain more. What we have done is, from a developer point of view, you still give high-level prompts like, “Hey, build me this app, push it to my sandbox, generate the configuration for it, and then deploy it, and I want a successful deployment.” What’s interesting is when you hand it over to the system, it will obviously attempt it, but then it also has access and the ability to talk to other intelligent systems like Klaudia. If the deployment failed for some reason, it would do something like, “Hey, deployment was successful, but this namespace, these pods are starting but they’re not staying in running, it goes into a crash loop back off. Can you investigate and suggest what should be fixed?” I have paraphrased what it says because agents typically give a lot more context in the prompts being used to make the answer more correct. Then there’s a sub-agent implemented as a part of the CAIPE effort which would talk to Klaudia and be able to troubleshoot that. Now I’m going to hand over to Arthur. He’s going to explain how some of that piece is done, which is today’s talk, and then we’ll go into more details.
Arthur: Sure. Sounds good. Thanks. Am I able to share my screen? Let me just get my screen sharing. Cool. So this is a live demo of CAIPE. Something we often do as SREs is interact with Argo CD and get our deployments sorted if they’re failing. So I can ask, “Show my CAIPE apps in Argo CD and derive the service names, debug with Komodor if any are failing.” CAIPE is a deep agent, which means it can do longer horizon tasks involving planning. You can see at the top it’s actually started streaming the requests for CAIPE because we have three deployments across our environments. Because it’s a multi-agent system, it can call these different agents in sequence, starting with Argo CD, and then once it has the information, it can go to Komodor and Klaudia to debug. This involves several agents, so it might take a bit of time. We also have other agents like PagerDuty and Jira. You might ask, “Okay, who’s the SRE on call? Can we get their tickets?” And then Komodor can help with some of the troubleshooting, especially Git for more complex scenarios.
In the streaming, we can see that it’s done some analysis. Once the agent actually responds fully, we should get a nicely formatted response. It is on the final step now, synthesizing the findings. We’re also using streaming, so you can see when the sub-agent tools are activated. We can see here we found three applications for CAIPE. They’re healthy, but one of them has some issues. If I look at the dev environment, which is typical of dev—there’s always some issue there—we can see that some of the containers aren’t ready and information from Komodor as well. We were able to implement this through the Komodor API. Komodor helpfully has an Open API spec. What we’ve done is we’ve automated the generation of this agent because the Open API spec has a lot of information, and agents have come a long way since the initial days where they can reason with a lot more tools. What we’ve done is we’ve written this Open API MCP code-gen tool. It processes the Open API spec, generates an MCP server, and it can also generate an agent. What we did is, from the Open API spec of Komodor, we generated the agent. As part of this, we also implemented that you can ask about, for example, “Go to the Komodor UI from the chat.” This is really helpful if you run into an issue and you want to explore more in the UI.
We’ve also done evals and benchmarking. One thing that we introduced is creating a data set for Komodor. To benchmark this agent, we were able to do some test queries and make sure they get added to a test data set. We can see here I’ve done some evaluation runs for Komodor. I made some tweaks to ensure that the accuracy is increasing over time. I had 21 test cases at one point because I was comprehensively ensuring that all aspects of the Komodor API are possible by this agent. We’re using Langfuse for the evaluation. The key here is that we built a golden data set with the queries we expect from the agent, and then we’re able to run the evaluation. This is an example of how we evaluated it. You can give some general knowledge like, “We want to use this cluster for evaluation,” and then you can go through each tool individually and give expected prompts and what the response should be. That’s how we build the data set.
We also did a SLIM integration, which is Secure Low Latency Messaging. That ensures that if your agent is running—say you have an enterprise agent, and maybe your Komodor agent is running somewhere else, closer to the clusters—you want to ensure that you have improved security on individual conversations instead of just HTTPS. You want better encryption, then you can use this SLIM protocol, which is also developed by Agency. We also implemented that with the Komodor agent. It’s been a really amazing tool for doing a lot of this exploration and investigation, and that we’re using for debugging. My final points are probably around the architecture of CAIPE. If you’re interested, the way it works is we have a supervisor agent that’s a deep agent, we have these sub-agents (one of them is Komodor with MCP tools), and there’s also a reflection agent. That’s quite useful if we want to make sure that the whole task is complete. A few months ago, we migrated to deep agents. This is where the planning comes in, and that’s the direction we’re going in. On that note, we have a question from the audience.
Udi: Venkata is asking what framework was used to build this multi-agentic AI system?
Arthur: Sure. I can go to the repo. CAIPE, as a project, you can run it as a distributed multi-agent system or a monolithic approach. The important thing is things are abstracted into A2A and MCP, so you can use different frameworks. We have used a lot of LangGraph in many places because that was how the initial version was built. There are some examples with Strands as well. But it doesn’t matter which framework you write something, as long as you abstract it to either A2A or MCP if you’re trying to do tool calling.
The structure is fairly straightforward. We have our agents and our multi-agents, as well as a knowledge base. You can see here there are a lot of sub-agents. It’s quite easy to just add one here for a particular function. We also have a common library to help you write your agents, and then there’s also this generator tool where if you have some functionality that’s not implemented yet, you can take the Open API spec and generate an agent, and there’s some example code on how to do this. We use Komodor in the blog, but you can also use other Open API specs. The main thing is the A2A integration because we’re a distributed system, but we also support adding your agent as a regular just over like in LangGraph, if you’re just adding it. If it’s distributed, then it can be any framework. I hope that answers the question.
Udi: Is there anything else you wanted to show, Arthur? Because I have a couple of questions myself.
Arthur: No, I think that’s on my side.
Udi: Hasith, you mentioned a burnt-out SRE team when you joined Outshift. How has that changed for the SRE team? How is their life different now?
Hasith: That has changed pretty dramatically. Most of it boils down to good platform engineering. We had, and I’m not even joking, five different ways of doing Kafka; that’s a message bus, you don’t need that. You need a good pattern, but you should not be spending your innovation cycles on trying to do things around message buses. We had a poor man’s way of AWS accounts. When I started, there were more AWS accounts than the number of engineers in the wider organization—literally a couple of hundreds. We’ve come back down to I think under 11 accounts right now, with three main accounts. These are patterns, but you have to think about it. Just because you can do it doesn’t mean you should. You need to have more than one, but it’s not 200. There was a similar approach; we had a lot of Kubernetes clusters without a huge amount of sharing, partly because people were trying to do things and didn’t want to step on each other. Those types of things you’ve changed. These are technical things. Otherwise, certainly how the team thinks and operates—a lot of functional things—and then the agentic and AI piece really enables us further. That’s how this is playing out.
Udi: You also mentioned the growing complexity of Kubernetes. You have a lot of clusters on EKS but also edge clusters. Firstly, what is a ballpark estimation of the scale of your Kubernetes infrastructure, and then how would you say that Komodor Klaudia helped simplify it for the wider team?
Hasith: The interesting thing is when I started, there were a lot more clusters. We’ve actually brought the number considerably down. We still have a couple tens of clusters, but we are consciously keeping it down because it’s a cost. Even if you have the automation and things working, it’s another thing that goes wrong. The main thing with Klaudia and Komodor is really the visual aspect: knowing where your clusters are, it’s easy for people to log in. You also get the access and other pieces streamlined across the setup, so you can have a very good story around how RBAC and access management is done. When it comes to troubleshooting aspects, people are able to self-service those things, and that’s for the developer persona. When you’re looking at the SRE and the operation persona, a lot of teams don’t really keep Kubernetes up to date. When you scale really high, people find it hard to do. So, certainly, it helps us be nimble with that. We’ve been very good at that: having fewer clusters and making sure the operational things happen. It’s really helpful in that point of view.
Udi: Yeah. And what gave you the confidence to implement Klaudia? I know Arthur mentioned a lot of the validation steps and the LLM-as-a-judge component in there as well, but if you had to name a few key things that said, “Okay, I can run this in production, and this can be something useful and not dangerous”?
Hasith: I think the main thing for us was being in the incubator—you have a bit more risk tolerance, we can take risks to some degree. That was one part. Then, we also knew there was no other way out. Modern platforms are not getting simpler. Even some of the consolidation efforts were not small; they were very difficult to pull off in order to simplify the problem. We had to come up with creative and innovative ways to get those advantages, even if they’re a bit risky. The more we did that, the confidence built iteratively. If you look at 2024, we started very small. The efforts were separate; it wasn’t a grand vision. But then everything fit into place, and when you get some success and some failures, you can build on it and take more risk with it. For example, I don’t think Arthur had any doubts when he approached it, but go ahead.
Arthur: Yeah. No. It was… I think once I started using Klaudia through the agent, I realized how naturally CAIPE and Klaudia could collaborate to solve a problem because the response from Klaudia is text, and CAIPE knows, “Okay, this is the information you need, you can trigger the RCA, and then go back and forth.” I think it worked quite well, and it helped that the API was documented because the LLM does need to know what it’s doing. But yeah, that’s good.
Udi: For a developer now, would you say that the response or the experience is that they are more self-sufficient, they can troubleshoot things on their own, whereas in the past it had to fall on the SREs or your team?
Hasith: Yeah. I think because we had the sandbox environment set up, and we did workshops on, “Here’s how you deploy your container, this is how you use Klaudia to debug it,” I think we got good reviews. There are two elements there. Typically, when you’re trying to deploy something, people think, “Oh, it’s an SRE problem,” and they go ask more engineers, and you get those types of questions. That’s one. The agentic interface is really helpful there because you have almost given everybody their own SRE that they could ask questions. There’s no such thing as a stupid question. They could just bombard, and then it’s almost like giving you pointers: “Hey, maybe you should try this. Let me help you with that. Just fill this form for me if that’s what you’re trying to do.” That really makes it a very big game-changer. Absolutely.
Udi: What is your vision for what’s next for CAIPE, or for the future of agentic AI or platform engineering?
Hasith: A couple of aspects. Starting with Komodor and the interaction we showed, it’s really the native agent-to-agent collaboration. What we may end up seeing is intelligent systems talking bidirectionally within their parameters and safeguards, not just unidirectionally. You could almost imagine a future in which if Klaudia is trying to do something, Klaudia could ask, “Hey, I’m trying to do this. There’s something funny going on here. Can you tell me a bit more about this?” You can have that collaboration. You could have these agents being discovered and things dynamically happening. The collaboration aspects of machines doing things are super interesting. What we have seen so far are assistant agents, whether it’s coding assistants or even in the examples we showed, it’s primarily a human assistant type of thing going on, which is good. We’ll see those types of patterns for years to come, and there is a place for it. Another thing that we are also seeing is completely autonomous and long-running agents. It’s almost like you delegate a specific objective, or it’s like, “Hey, when this event happens, do this.” You’re going to have those types of activities into these autonomous systems. I think we are still some time away before these all become mainstream, but the interesting thing is there’s a lot of productivity to be gained by doing these patterns. That’s what I’m most excited about—this combination of doing the assistant type of things really well to get you that 10x capability. If you do it wrong, you don’t realize that gain because you can create a lot of mediocre code very quickly. Many people write code, and then projects become unsustainable, and there’s a lot of cleanup, or it blows up in some way. How do you realize that actual productivity? When you start delegating into the autonomous systems, you really go beyond. I could almost say it’s like 100x because you have things that are going to be working 24/7 on whatever objectives or things that you’re trying to do. That’s going to completely unlock a lot of possibilities and capabilities.
Udi: Yeah. This also aligns with Komodor’s vision for Klaudia to make it more autonomous and gain confidence with customers that will allow it to actually be self-healing and act as an extension of the Ops teams and not just an assistant. I very much agree that this is the future, at least what’s exciting about it for us. There’s another question from Venkata about how you fine-tune your agents to ensure it works as you expected. Is it prompt engineering, or anything more to it?
Hasith: There are several things. Maybe Arthur has some really good ones that he’s working on; he can touch on. I’ll add a bit more. Go ahead.
Arthur: Yeah, sure. We do it at all levels of the stack, from system to context engineering. Our RAG system actually comes in really helpfully because if the agent needs to do a set of tasks, we’ve got our playbooks in our docs. The RAG fetches the playbook, puts it in the context, and now the agent knows what it needs to do to, say, debug an application. At the same time, we’re leveraging LangGraph in a more deterministic way. We’ve implemented this concept of ‘task config,’ which means if you have a known process, you can encode this in a set of steps that the LLM is guaranteed to follow. That’s quite useful; it sounds similar to playbooks, but we’re doing it in an encoded format to give the LLM guarantees. There’s also exploration in neuro-symbolic AI where we’re using approaches in symbolic AI to see whether we can get our deep agents to learn patterns from data and then encode that in a way that can be executed reliably. It’s a lot of work in that area.
Hasith: Touching on a couple of other things: it’s “garbage in, garbage out.” That’s true. If you just point everything at a ton of outdated wikis and all sorts of things, you’re not going to get a good outcome. In fact, when we initially did the talks, we had an effort to redo some of the docs. Less quantity and more quality is better than just pointing it to a lot of outdated things. That curation role, to make sure that the data is good, could be agent-assisted, but that’s an important aspect. This even applies in doing graph relationships. If you are doing it at a very large scale and very complicated, you may want to consider having oversight and thinking about those aspects beyond simple automation. Other things are, as Arthur mentioned, you obviously have the generative models and techniques there. You can consider the type of model you use, like reasoning models for more complex things, figure out how you do the context, then use smaller models that will end up being cheaper, faster, and more accurate when the problem is more defined, versus something very general purpose that’s going to have more hallucination probability. The point is, it’s not about just an AI; there are a lot of other disciplines in machine learning. Things like symbolic AI could be used hand-in-hand in order to create better systems.
Arthur: Just one thing I’d add is also evaluations. I think that’s one of the biggest tools in your toolbook. Langfuse and LangSmith are both software we’ve used, and they work fairly well.
Udi: Another question for you both is how do you show the value internally of the platform you’ve built? How do you display that it was actually worth the effort and time and money invested into it?
Hasith: From our side, it was clear, but this wasn’t funded separately. We had to find time and resources to do it. It wasn’t like somebody wrote us a check. We’re in an incubation unit, but we are trying to serve a lot of things. It was too busy to even think about innovation. “How do we get ourselves out of it?” That was our motivation to pursue this. When you start small and show success, you are able to get that buy-in and opportunity. Particularly with how things are evolving and how fast things can change, the main thing is: start small, show actual value, and then iterate from that. Obviously, it will help to get the right leadership buy-in and other things. Then, measure what matters; don’t measure the wrong things. It’s very hard, depending on the organizational culture and setup. Sometimes, before you start something, you have to show how the KPIs are going to work and how the value is going to be created. Especially if you’re trying to do something very new and very risky, it’s very difficult to do. That conversation is a non-starter. So, in some ways, avoid having that conversation; have some proof and ammunition before, and then you can even get more investment by going and showing value.
Udi: Yeah. Not easy, but yeah. Nobody said it was easy. It looks nice on that slide you showed where there are very clear milestones, but I imagine it was more complicated than that. Okay. So we are almost at a time. Is there anything you want to say as closing remarks? Any summary or things you want to leave the viewers with?
Hasith: It was a pleasure talking to all of you guys. I would say the main thing is there’s a lot of hype around AI, but it also has a lot of transformational properties. The important thing is to embrace, learn, and navigate because, while these systems are capable, humans have to work alongside these systems for at least the next two decades. I think we’ll learn a lot of things, and the world will evolve in ways that so many things are possible, which is the optimistic side of things. Anything you want to add?
Arthur: I can echo that. Also, these tools in the SRE domain are super helpful. Do get involved, either on the open-source side or with Komodor, because anything that can help to take the burden off the SRE is really good for organizations.
Udi: What would you say to people from organizations similar to Cisco in scale or complexity of their infrastructure? What would you say to those who are evaluating an AI solution? We know it’s buzzworthy right now. Maybe they’re considering Komodor and Klaudia. Maybe they’re considering other solutions, building in-house, something open source. What would you say are the key things to consider, or why are you using Komodor and not something else?
Arthur: I can answer the Komodor side first. I think the quality of the tool matters. I can go out and find a hundred MCP servers that say they do something, but when you actually try to use them, it doesn’t work. Whereas with Komodor, it works reliably. In the more general sense, my advice is to use an open-source framework like LangChain or Deep Agents, or CAIPE, because it means you have the flexibility of choosing the things that work for you.
Hasith: One thing I would say is when you decide “build versus buy,” the main thing is doing a Proof of Concept (PoC). Once you get good at some of this, getting to 60% to 80% accuracy is not too bad when you have people who know what they’re doing. But getting from 80% to 95% or 97% to 99%—that’s hard. Making sure you get all the evaluation and all these aspects of these probabilistic systems is quite hard. That’s when I think the buy option makes a lot of sense. If you can spend the money, that’s a challenge. Going from a PoC that you can do for a certain accuracy to a very good functioning system—it takes effort. You need to be prepared to have a strategy to do it if you’re trying to do it on your own. This is where the buy option makes sense. Also, if there are opportunities to do it together, the community aspect makes sense as well because you can learn together and try to do the right things or patterns that work together. That’s good.
Udi: I think we’ll end on that note. Thank you so much for joining us. Thanks to everyone who is watching us live or watching the recording. As I said at the beginning, we’ll share the deck and the recording with everyone. Feel free to reach out to Arthur on LinkedIn. Or do you have any other places where people can find you that you want to share?
Arthur: LinkedIn is perfect. CNCF also works.
Udi: So thank you both again so much, and till next time. Bye now.
Hasith: Thank you. Bye.
Arthur: Thank you. Bye.
Gain instant visibility into your clusters and resolve issues faster.