Home
Komodor Blog
Multi-Agent AI SRE Has Landed and Its Built for Your Most Complex Stacks

Multi-Agent AI SRE Has Landed and Its Built for Your Most Complex Stacks

Itiel Shwartz, CTO & co-founder

8 min read March 24th, 2026

Once upon a time, a monolith running on a handful of servers meant that incident management, even at 2:17 AM, was something a single generalist could handle. One person with enough context across the stack could reasonably diagnose whether the database was choking, a config had changed, or a server was running hot. They’d fix it and go back to sleep.

The sheer scale, complexity, and interconnectedness of modern cloud-native infrastructure has broken that model entirely. Today you’re running hundreds of microservices across multi-cloud environments, with Kubernetes orchestrating workloads that depend on service meshes, GPUs, storage layers, IAM policies, ingress controllers, message queues, and cloud provider APIs, all of which can affect each other in ways that are invisible from any single vantage point. When something breaks, whether at 2:17 AM or 2:17 PM, the symptom surfaces in one place and the cause lives somewhere else entirely.

You don’t need a generalist anymore. You need a Kubernetes specialist, a networking engineer, a database admin, a GPU expert, working simultaneously, sharing context across domains, with someone coordinating the picture. That’s what actually resolves modern incidents.

And that collaborative, multi-specialist war room model is exactly what AI-driven site reliability engineering has been failing to replicate, until now.

At KubeCon Europe 2026, Komodor is unveiling a new extensible multi-agent architecture for Klaudia AI. To understand why it matters, it helps to start with why building AI for infrastructure is so fundamentally hard.

The Real Problem Is Context, Not the LLM

Most AI operations tools fail at infrastructure for the same reason, they’re trying to reason across a web of interconnected systems with either too much data or too little of the right kind. Dump the entire cluster state into a prompt and the model drowns in noise. Give it a narrow slice and it confidently fabricates conclusions from insufficient context. Neither produces trustworthy results in production.

The symptoms that surface during an incident almost never point directly to the root cause. A pod crashes because of a network policy. A request times out because of storage throttling. A training job fails because, three layers down, a GPU driver doesn’t match the CUDA version. At this type of scale and complexity, what happens more often than not, is the symptom appears in one place, yet the cause lives somewhere else entirely. Most AI SRE tools today will show you one layer, while the actual problem spans ten.

This isn’t an LLM problem, it never was – it’s a context engineering problem. The key insight behind Klaudia’s architecture is that the challenge isn’t collecting more data, it’s knowing exactly what matters when it matters – retrieving the right system context at the correct moment, and reasoning across it with domain-specific precision.

Building an AI That Thinks Like a Senior SRE

We’ve written before about what a real SRE war room looks like – the Slack channel that spins up within seconds, the Incident Commander getting tagged, the specialists pulled in from across the organization, each examining their slice of the stack while one person tries to synthesize the picture. That human model works precisely because it combines specialization with coordination. What it doesn’t do is scale, run at machine speed, or function reliably at 2:17 AM on a random Wednesday when the most critical engineer, with all the accumulated tribal knowledge, is on PTO.

Klaudia’s architecture is built to replicate what makes that model effective, not to replace the insight behind it. A senior SRE doesn’t look at everything when an incident fires. They work iteratively, form a hypothesis, gather targeted evidence, evaluate findings, refine or conclude. They know which signals matter for each domain, which failure modes to rule out first, and when they have enough context to act confidently.

Every agent Klaudia deploys is built around this same methodology and domain expertise. The team starts by defining what each agent’s goal is, which establishes its scope and focus. From there, they specify how it should investigate and what patterns to look for, encoding the domain reasoning that would otherwise live only in an expert’s head. They define what tools and data sources the agent can call on, and how it should format its findings so the broader investigation can actually use them.

That last part matters more than it might seem, an agent that reaches the right conclusion but can’t communicate it clearly to the orchestrating workflow is only half the job done. This consistency of structure across every agent in the platform is what makes the architecture extensible and reliable rather than just complex.

How the Architecture Actually Works

Klaudia’s platform is organized into three layers that work together in every investigation.

At the top sits the Domain Agnostic Core, which serves as the shared infrastructure that powers every workflow regardless of what’s being investigated. This includes the:

Planner
Enricher
Action Executor
Validation Engine
Guardrail Engine
Knowledge Graph
Eval Engine, and
Continuous Learning components.

Komodor | Multi-Agent AI SRE Has Landed and Its Built for Your Most Complex Stacks

These aren’t domain-specific, they’re the reasoning machinery that makes all agents work reliably at scale.

Below that are the Agentic Workflows:

Detect → Investigate → Remediate → Optimize → Prevent

These workflow agents are the orchestrators. Each one owns a specific phase of the reliability engineering process and coordinates the investigation flow. They’re job is to make the right judgement calls at the right time, such as: deciding what to examine, which specialists to consult, and how to synthesize findings before handing off to the next workflow stage or initiating remediation.

At the bottom layer sits the critical piece – Domain Specific Expertise.

This is the layer with the Subject Matter Expert Agents, or SMEs, that bring deep knowledge about specific technologies.

Today Komodor ships more than 50 of these agents across the cloud-native infrastructure stack, covering domains like GPU/NVIDIA, AWS, ArgoCD, Istio, Cilium, Airflow, Redis, Kafka, Postgres, and dozens more. Each one is an autonomous module that is an expert in exactly one domain. Cloud-layer coverage is actively in progress, and APM support is planned next.

When the Investigator receives an incident, it forms an initial hypothesis and selects which SMEs to consult based on what it finds. Those SMEs run in parallel, each examining their own domain, and feed their findings back to the Investigator. If there’s enough to conclude, it surfaces a root cause. If the investigation shifts into action, it hands off to the Remediator. That handoff capability between workflow agents is a meaningful distinction from systems where a single agent tries to own the entire flow from detection through resolution.

One key architectural property that makes this work in practice is isolation. A poorly-performing or uncertain agent doesn’t contaminate the rest of the investigation. Each SME gets only the context it needs, reducing hallucination risk while keeping the overall system accurate.

The Knowledge Graph – the Thread that Weaves Across the Stack

One of the hardest problems in infrastructure intelligence is understanding how systems relate to each other across boundaries. Real incidents cascade, they don’t stay in the domain where the symptom appears. Without the ability to follow that chain, you’re debugging symptoms rather than causes.

Komodor’s relationship engine maintains a dynamic knowledge graph that maps entities and their connections across the entire cloud-native stack. Agents can follow chains in both directions as follows: forward (A uses B) and reverse (B is used by A). When an investigation starts, agents don’t search blindly across the entire system, they follow the graph from alert to service to pod to node to GPU to deploy, retrieving only what’s connected and relevant.

This has two important effects. First, it preserves context window space by fetching precisely what’s needed rather than everything, and it compresses investigation time dramatically, agents follow the relationship graph directly to relevant data instead of scanning the environment for signals.

New agents that join the platform immediately plug into this relationship graph.

When the ArgoCD agent was added, it naturally extended the existing Deployment → ReplicaSet → Pod mapping into Application CRDs.

When the Airflow agent joined, the graph extended to DAG → TaskInstance → Worker Pod → Node.

The graph grows with each new domain without requiring existing agents to be retrained or rebuilt.

Multi-Track Context Enrichment

What makes the architecture particularly capable for complex incidents is multi-track investigation. The Main RCA Track and the SME Agents Track run in parallel, enriching the investigation iteratively across multiple passes.

In each iteration, the main track accumulates evidence while SME agents surface domain-specific findings, for example:

The Secrets SME checks for certificate issues
The Storage and PVC SME examines mount state
The GPU agent pulls DCGM metrics and kernel logs

A Knowledge Base Query Agent runs alongside both tracks, pulling relevant content from indexed customer documentation, runbooks, and postmortems stored in a vector database. Historical learnings from past investigations are also surfaced through VectorDB queries, so patterns Klaudia has seen before inform each new investigation.

This means Klaudia gets smarter with every incident it handles. Past root causes, remediations, and environment patterns are captured and automatically indexed per customer. The tribal knowledge that normally lives only in the heads of experienced engineers accumulates in the system over time.

What the Organizational Context Layer Changes

There’s a meaningful gap between understanding cloud-native infrastructure in general and understanding any specific organization’s infrastructure. Kubernetes manifests tell Klaudia what is deployed. They don’t explain why things are configured a certain way, what the blast radius of a given issue looks like, or how a specific team has historically handled similar incidents.

Komodor closes this gap through three context sources that work together.

The Blueprint is always loaded and contains architectural truth: service dependencies, topology, constraints, and compliance rules specific to the customer’s environment.
The Knowledge Base is queried on-demand, surfacing relevant content from customer Confluence pages, runbooks, and postmortems through semantic search.
The Self-Learning Memory accumulates automatically over time, capturing root causes and remediation patterns from every investigation.

Together, these transform Klaudia from a generic AI SRE into an expert that understands your infrastructure specifically, not just infrastructure in general.

The Backbone of Klaudia Agent Velocity

The most tangible proof point for platform and infrastructure teams evaluating this architecture is delivery velocity. Building a new Klaudia agent means encoding domain expertise into a structured, testable format that plugs into the existing platform, with no model retraining and no rebuilding of orchestration logic.

The GPU agent went from zero to GA in four weeks, moving through research into NVIDIA failure modes and DCGM metrics, into building GPU-specific tooling and a first prototype in Klaudia Lab, through shadow testing on production, and finally A/B validation and customer beta before shipping. The ArgoCD agent reached GA in two weeks. The Airflow agent, which required mapping the full DAG through TaskInstance through Worker Pod through KubernetesExecutor relationship chain and integrating with the Airflow REST API, shipped in four weeks and delivered 55% faster pipeline failure diagnosis against baseline.

The reason this is possible is that the platform was already ready before any of these agents existed. Workflow agents already knew how to investigate, remediate, and learn. The relationship engine already understood how entities connect. Each new domain is an extension of a mature platform, not a rebuild from scratch.

Extensibility to Drive Autonomous Architectures

The extensibility isn’t just theoretical. Cisco Outshift built JARVIS, an AI Platform Engineer for automating developer workflows across their cloud-native environment. When a developer hits a CrashLoopBackOff deploying to Kubernetes, JARVIS calls Klaudia as a subagent via A2A protocol. Klaudia investigates, returns root cause and remediation, and the developer gets an answer without leaving their workflow. The result is up to 80% reduction in MTTR for Kubernetes issues, with query response time dropping from hours to seconds.

The formal Bring Your Own Agent capability extends this model to any customer. Organizations define the trigger conditions for their own services, tools and agents via MCP or an OpenAPI specification, explaining what external systems it can query, the expertise it encodes, and the format of its output. A single Python file, validation in Klaudia Lab, and the agent joins investigations alongside Komodor’s native SMEs, running in a sandboxed environment with all actions audited.

The architecture is designed to keep growing. Cloud-layer coverage is actively in progress, and APM support covering observability integrations with tools like Datadog, Grafana, New Relic, Splunk, Prometheus, and OpenTelemetry is planned next. Application issues introduce different challenges from infrastructure ones, with softer degradation rather than hard crashes, request paths cutting across many services, and user sessions as the primary unit of impact. Klaudia handles these by extending the same reasoning patterns into a new layer, with application-specific SMEs plugging into the same proven framework, with no changes to core orchestration and no rewrite required.

The multi-agent framework for Klaudia AI is already available. Komodor will be demonstrating these capabilities live at KubeCon Europe in Amsterdam, March 23–26.

Latest Blogs

Komodor Introduces Extensible, Autonomous Multi-Agent Architecture for AI-Driven Site Reliability Engineering

Out-of-the-box and bring-your-own AI agents that encode operational knowledge boost troubleshooting speed and accuracy across cloud native infrastructure

FinOps in the Age of Kubernetes: When Everyone Owns the Bill

Platform teams find themselves caught in the middle, trying to optimize shared infrastructure while both sides insist their priorities are non-negotiable. This conflict plays out across enterprises constantly, and it reveals a fundamental problem with how cost optimization works in cloud-native environments. The typical FinOps model, where a centralized team identifies savings opportunities and pushes recommendations to engineering, assumes that cost and operations are separate domains that can be optimized independently. In Kubernetes, that assumption breaks down completely.

Komodor Launches Global Partner Program to Accelerate AI-Driven Reliability and Cost Optimization at Scale

Komodor, the autonomous AI SRE company for cloud-native infrastructure, today announced the launch of the Komodor Partner Program, designed to enable and reward partners delivering AI-driven cloud-native infrastructure reliability and optimization services to enterprise customers. Foundational partners include Cloud Bazaar, Matrix DevOps, Trace3 and others.