Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Discover our events, webinars and other ways to connect.
Here’s what they’re saying about Komodor in the news.
Join the Komodor partner program and accelerate growth.
Webinars
In this webinar, Komodor experts discuss the inherent complexities of deploying and monitoring self-hosted SaaS applications across diverse, customer-controlled Kubernetes environments. They explore the transition from traditional, API-driven observability methods to an “Agentic AI SRE” approach, highlighting how AI agents can autonomously troubleshoot and remediate issues when provided with the right context. Finally, the session covers practical solutions—like environment blueprints and localized knowledge bases—that give AI tools the necessary guardrails to move from reactive firefighting to proactive, automated incident resolution.
Speaker: The session is led by a host and features Mickael (R&D Tech Lead at Komodor) and Nir (from Komodor’s CTO/AI team), sharing their deep expertise in Kubernetes and AI-driven troubleshooting.
Focus: The core focus is addressing the steep observability and management challenges of self-hosted SaaS applications deployed in diverse, disparate environments.
Core Concepts: Key concepts include overcoming observability fatigue, the shift from traditional API-driven monitoring to Agentic AI SRE, and effectively managing LLM context limits using sub-agents.
Includes: The discussion includes real-world examples of Kubernetes configuration nightmares, strategies for building lean and modular agents, and deep dives into features like dynamic blueprints and knowledge bases.
Wrap-up: The webinar wraps up with a Q&A session on proactive self-healing capabilities, establishing trust and guardrails for autonomous AI actions, and an optimistic look at a future where AI handles routine firefighting.
Please note that the following text may have slight differences or mistranscription from the audio recording.
Udi: Welcome Pablo, welcome everyone. We have 260 people registered for this webinar, so it will be a packed discussion today. Welcome Ashkan, Amin, Nicola, Daniel, and Chi. I think we are ready to get started.
The title of this webinar is “Conquering the Complexity of Self-Hosted Apps with Agentic AI SRE.” Just a reminder that this session is recorded, and we will share the recording and the deck with everyone who registered. If you have any questions throughout this discussion, feel free to drop them in the Q&A below, and we’ll get to them either at the end or answer them live if it works out.
Today, we are going to discuss the many challenges of hosting a SaaS product in different environments. If you work at a SaaS company, work in infrastructure, or have ever been on call, this webinar is for you. To talk about this, we have Mickael and Nir, two highly experienced people on this topic. Please introduce yourselves and tell us what you do.
Mickael: Hi everyone. I am Mickael. I have worked for the past five and a half years at Komodor as the R&D Tech Lead. I come with over a decade of experience managing Kubernetes clusters and infrastructure, as well as architecting high-scale backend applications and data pipelines.
Nir: I’m Nir, part of the CTO team and the AI team. I work on Claudia, our own agent that does investigations, self-healing, and other AI-integrated flows within the system. Happy to be here.
Udi: Happy to have you both. Let’s set the stage. What is the big issue with self-hosted apps? What do we even mean when we say self-hosted, and who is this for?
Mickael: I can start with that. In the context of this discussion, self-hosted applications mean that as a SaaS company, there are components necessary to deploy on-premise—for example, on a customer’s own Kubernetes cluster. This is a standard pattern known as Kubernetes operators or agents, and it’s mostly what we are going to focus on today.
Our core issue is that, as a SaaS company, you try to ship one integrated product. However, because we require this Kubernetes operator or agent, it ends up running in hundreds of different environments that we do not control. That creates a massive burden of support. It requires support engineers, platform engineers, SREs, and constant communication with the person handling the environment to understand the issues that arise.
You might ask, “Isn’t there a finite set of ways to deploy things on Kubernetes? Can’t we just document everything?” The answer is very complex. Even if customers use the same cloud provider and the same Kubernetes version, your operator will behave differently due to networking, scale, security requirements, and a multitude of other factors that differ for every user.
Udi: Let’s talk more about how these environments differ. Can you share specifics on what exactly is different about each environment, why it is problematic, and any real-life stories of how it plays out?
Mickael: I definitely have hundreds of war stories from being on call. Environments differ in many ways: the Kubernetes version itself, CNI (networking plugins), storage drivers, underlying node architecture, machine sizes, and different permission schemes like configuring RBAC and service accounts.
If you are installing a Kubernetes operator, it will likely use a service account attached to a cluster role with predefined permissions. Customers can choose to allow or deny certain things, which affects what the operator can do. Networking is another major issue. Network policies, proxies, DNS configurations, and different cloud providers handle networking differently.
For a real-life example, we have a customer using a non-standard CNI like Cilium or Calico. Every communication inside the cluster—even pod-to-pod communication—goes through that network interface. Depending on how it is configured, communication might just fail. Our operator would try to fetch data from the API server and randomly fail with a generic “connection refused” error. To fix it, you first have to realize they have a different network interface, and then you need the ability to read its configuration or logs to understand what went wrong. It becomes an intense, low-level investigation.
Nir: Every cluster is different by design. You need to adapt it to your company’s own needs and tech stack. You need to know how to navigate these differences. For example, many customers use internal proxies for outbound traffic. You need to know how to handle that if you want to provide service to them, and that’s exactly what creates the problem we are talking about.
Mickael: It’s similar to development and staging environments. You always aim to have something as close to production as possible, but there is always a gap. Take storage, for example. You might have three different read access modes for your volume, plus five different variables and annotations for your specific driver. That immediately presents over 200 possible configurations just for one storage component. Combine that with every infrastructure component on a Kubernetes cluster, and it becomes a monstrosity.
Udi: What about monitoring tools? We use them, our customers use them. Shouldn’t we know exactly what’s going on all the time?
Nir: There are thousands of monitoring tools, and you have to know how to work with all of them. And even if you monitor heavily, network issues, for instance, are hard to extract metrics for.
Mickael: You might face different naming conventions, lack access to the metrics, or deal with different observability providers. Even within open-source standards like OpenTelemetry, you have different versions with breaking changes. We have metrics on our own operator, but we cannot collect everything by default. The first step of monitoring efficiently is knowing what you need to cover, but because there is so much, it’s difficult.
Udi: It sounds like a never-ending loop. The more tools you add to simplify things, the more complexity and fragmentation you introduce. How do we overcome this?
Nir: First, you need to be prepared. Understand the possibilities and the environments you will face. Test your deployments on bare Kubernetes, OpenShift, and different clouds. Find the weak points that will break your agent. Collecting this information and running these experiments makes your deployment more resilient.
Mickael: I completely agree, but I want to add that there is a real observability fatigue. It’s easy to assume you should collect as much data as possible, but that often causes more problems because you end up with massive amounts of irrelevant data.
I treat observability as a product. You need to think about what your application is doing, what the user flows are, and then answer a simple question: “How do I know it’s working?” For example, if I have an operator that sends data to an API, the main question is simply: “Am I able to send data?” If not, I only focus on collecting the data necessary to understand why. Everything else is irrelevant.
Nir: It is also a matter of cost. You don’t want the customer to pay for heavy network latency, and you don’t want to store too much data. By design, you should make the agent in the customer environment as lean as possible. Less code means less friction, fewer bugs, and more resilience.
Mickael: Exactly. Cloud monitoring solutions can be very expensive. Plus, you have to account for the human cost. If you make the observability data too complex, you either spend too much time trying to read it, or you have to pay a highly experienced engineer to decipher it.
Udi: Before we dive deeper, Serge from the audience asks: “What are the main observability challenges you encounter?”
Nir: First, not all customers have the metrics we want to collect. Second, they have different stacks and expose metrics differently. Collecting and managing all of this data at scale across many customers is incredibly challenging. Doing it for one customer is easy; doing it for hundreds requires serious infrastructure to ingest and gain meaningful insights from the data.
Mickael: For me, it is the unknowns. We try to prepare, but there are always things we didn’t anticipate. When an application is running on a differently managed cluster, and the customer isn’t immediately available to answer questions or upgrade, you are left in the dark.
Udi: What is the most creative workaround or solution you’ve found for this problem?
Mickael: We once had an issue with our agent that resulted in a memory leak. It was very hard to reproduce, and Kubernetes doesn’t keep historical logs or events for very long once a pod crashes. So, we added a second container—a sidecar supervisor—that watched the main container. Whenever the main container crashed, the supervisor immediately collected logs, metrics, heap memory data, and the cluster state, and automatically uploaded a crash report to an S3 bucket. That raw data gave us exactly what we needed to troubleshoot without having to guess.
Nir: In my previous job at Palo Alto Networks, we dealt with highly isolated environments with strict security compliance. We learned to create a generic, lean approach to the agent because upgrading it was incredibly difficult. Upgrades should ideally be rare, primarily for security fixes. You need to design the agent flexibly enough so that new platform features can be supported without requiring the customer to update their local agent version immediately.
Udi: That is a great segue into what you would do differently to avoid these issues.
Mickael: I can give three tips on how to design better from the start. First, design for the unknown. Collect environment-specific fingerprint data (versions, components) so you have context when troubleshooting. Second, use a modular architecture so you can easily reuse components and change configurations. Third, treat observability as a first-class citizen. Build it into the design from day one so that adding context—like request IDs or user context to logs—happens automatically.
Nir: And test aggressively. Deploy your new agent versions to replicas of the environments you know your customers have to catch issues early. Look at the CrowdStrike outage; thorough deployment testing is critical.
Udi: Moving on to what is on everybody’s mind: AI. Let’s talk about our evolution from API-driven SaaS to Agentic SaaS.
Nir: In the traditional API approach, you had to know all the environmental differences and code for them ahead of time. In the Agentic approach, the AI agent is smart. It already understands the basics of Kubernetes, OpenShift, and AWS. We just need to give it the missing context—the specific constraints, tools, and tech stack of the customer’s environment.
For example, if the agent knows the customer uses Calico and there is a network issue between two services, it knows exactly what to check next: the network policies. Instead of coding every possible “if/else” scenario, we provide the agent with a high-level mission (e.g., “You are an AI SRE”) and a set of tools. It dynamically queries the environment to collect the context it needs to find the solution.
Mickael: The shift from predefined APIs to open conversation with agents is massive. But my question to you, Nir, is how do we prevent scope creep? It is easy for an AI to get too much context and become unreliable.
Nir: Big context creates two problems: LLM token limits and agent confusion. To manage this, we use sub-agents. The main agent asks a question to a sub-agent. The sub-agent goes and reads a massive YAML file, keeping its own context isolated. Its only mission is to analyze that file and return a specific, focused answer to the main agent. This keeps the main agent’s context clean and relevant. We also do this dynamically; the agent gathers context as the investigation progresses rather than loading everything upfront.
Mickael: Before this, troubleshooting meant manually stepping through a static runbook. Runbooks are limited because you have to know all the steps in advance. An AI agent can poke through layers of unknown unknowns and adapt its investigation on the fly. It might take the agent a moment to run extra queries, but it is infinitely faster and more adaptable than manual human troubleshooting.
Nir: It acts like an extension of your team. Developers often lack cluster permissions or deep Kubernetes knowledge. They can just ask the agent, which handles their ticket, freeing up the actual SREs to do higher-level work. We also allow SREs to provide feedback to the agent, helping it learn and improve its own performance over time.
Udi: We have established that an AI agent is only as good as its context. How do you provide the right context when you have thousands of different environments?
Nir: We use dynamic and static context tools. The first is a “blueprint,” similar to a .cursorrules or claude.md file. It’s a short, static, one-pager explaining the customer’s specific environment stack, custom CRDs, and how things connect. This saves the LLM from having to figure it out blindly.
.cursorrules
claude.md
Mickael: We can also put existing manual runbooks into those blueprints. We tell the AI, “If you see this specific problem, we already know the exact steps to fix it, so just execute this.”
Nir: The second feature is a Knowledge Base. We connect the agent to a company’s Notion, Jira, or Confluence. The agent can read past retro notes and documentation. For a human, searching through a massive wiki is hard. For an agent using a vector database, it instantly correlates past incidents with the current problem and suggests a verified solution.
Udi: We have a few questions from the audience. Patrick asks: “Does your agent implement proactive surveillance capabilities?”
Nir: Yes, we call it self-healing. It starts with a trigger—a violation created when a resource status degrades. This automatically triggers an RCA (Root Cause Analysis) agent. Once it finds the issue, it passes the job to a remediation agent. The remediation agent finds a short-term fix (like deleting a stuck pod so it reschedules) or a long-term fix (like creating a PR to update the CI/CD pipeline).
Mickael: To add to Serge’s question about GenAI, standard Machine Learning is great at predicting patterns, but GenAI goes further by telling you why it happened, how it happened, and exactly what to do about it.
Udi: Alan asks: “How do you make sure it doesn’t trespass and screw up everything?”
Nir: Customers have full control. They create strict policies dictating exactly what the agent is allowed to self-heal. You can fine-grain this down to the specific cluster or deployment. Furthermore, the agent prefers safe, short-term solutions. If a fix doesn’t work, it rolls back the changes automatically, using tools like Argo CD to ensure dependencies aren’t broken.
Mickael: We also always have a human in the loop validating changes and thoroughly testing the agent’s behavior during development to act as a final guardrail.
Udi: To conclude, what does the future hold for Agentic AI SRE?
Nir: The future is incredibly exciting. We are building agents with long-term memory that learn from past chat logs, RCAs, and manual human fixes. This means the system will continuously improve its own context automatically. We will spend more time designing robust systems and less time putting out fires.
Mickael: I agree. Currently, we use agents to react to triggers. The future goal is prediction and prevention. The agent will constantly monitor patterns, fix issues silently while we are asleep, and just send a morning report. Moving from being a firefighter to an architect of AI agents is the ultimate goal.
Nir: And as we build better guardrails, the industry’s trust in these autonomous agents will grow, allowing us to delegate more and more tasks.
Udi: Thank you so much, Mickael and Nir. This has been a super interesting discussion. We will share the recording and highlights with everyone who registered. If you want to see Claudia in action, check out komodor.com for a two-week free trial. Thanks everyone, see you next time.
Gain instant visibility into your clusters and resolve issues faster.