Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Discover our events, webinars and other ways to connect.
Here’s what they’re saying about Komodor in the news.
Join the Komodor partner program and accelerate growth.
Cluster API (CAPI) is transforming how organizations deploy and manage fleets of Kubernetes clusters by introducing declarative, Kubernetes-style APIs to automate cluster provisioning and lifecycle management. While CAPI excels at creating consistent and repeatable cluster deployments across different infrastructure providers, operating it at a massive scale introduces unique day-to-day challenges.
Many organizations running highly customized CAPI deployments face operational hurdles, specifically around maintaining end-to-end visibility.
Komodor partnered with a leading AI Cloud Provider to tackle these operational hurdles. By integrating Komodor and our AI SRE, Klaudia, we successfully bridged the visibility gaps in their highly customized CAPI infrastructure.
Here is a look at the challenges they faced and how Komodor’s targeted support for Cluster API and an extensible framework for 100s of highly bespoke Custom Resource Definitions (CRDs) solved them.
The AI Cloud Provider operates a complex, multi-tier cluster hierarchy spanning several regions to support its infrastructure and managed customer clusters. A core component of their setup is a dual-cluster architecture, meaning every physical node has corresponding resources split across two separate environments:
While this architecture is highly scalable, standard CAPI tooling only has visibility into the management cluster. Because actual node health relies heavily on the custom CRDs located in the workload cluster, engineers frequently encountered a “hidden” status gap. A Machine might display a “Running” or “Healthy” status in the management cluster, but the actual Node could be technically “NotReady” or stuck with a SchedulingDisabled taint due to an ongoing maintenance workflow orchestrated by the NodeMaintenance CRD.
NotReady
SchedulingDisabled
NodeMaintenance
Because standard tools couldn’t see this complete picture, troubleshooting a single node lifecycle issue required engineers to manually correlate resources across both clusters, read operator logs, and check custom resources, a tedious process that took 20 to 40 minutes per incident.
To resolve this massive operational bottleneck, Komodor developed a purpose-built interface for managing and troubleshooting CAPI infrastructure that intrinsically understands dual-cluster architectures and custom CRDs.
Here is how Komodor and Klaudia eliminate the manual toil of cross-cluster investigations:
Standard metrics fail to capture the conceptual complexity of dual-cluster environments. Komodor solves this by providing a multi-tier infrastructure overview, a consolidated “map” of the AI Cloud Provider’s entire estate, organized by hierarchy and management relationships.
Furthermore, Komodor features interactive relationship graphs that visualize the complete topology from the top-level Cluster down to individual Nodes across both management and workload clusters. This cross-cluster relationship visualization allows engineers to instantly identify broken links between CAPI Machines and workload Nodes, making the complex architecture easily navigable.
Standard CAPI logic is insufficient for highly customized architectures. To accurately determine node health, Klaudia was enhanced to understand the AI Cloud Provider’s specific operational logic. By establishing a strict “contract” (utilizing annotations or direct references), Klaudia deterministically links standard Kubernetes Nodes to their governing NodeMaintenance and NodeAcceptance CRDs.
NodeAcceptance
Instead of relying on superficial management metrics, Klaudia proactively hunts for actual health indicators in the workload cluster. For instance, Klaudia utilizes targeted discovery logic to detect a SchedulingDisabled taint on a Node. It uses this specific signal as a trigger to traverse the established contract and investigate the related custom maintenance CRDs.
When an issue occurs, Klaudia automatically gathers context from both environments simultaneously, pulling CAPI resources, maintenance operations, pod eviction statuses, and agent connectivity. It then analyzes this cross-cluster data to apply domain knowledge of failure patterns.
Instead of forcing engineers to decipher raw metrics, Klaudia instantly surfaces the actual problem (e.g., “12 nodes stuck in Joining phase”) and identifies the responsible component. This automation reduces the 20-to-40-minute manual troubleshooting process down to an actionable diagnosis in less than 30 seconds.
By combining deep custom CRD awareness with interactive dual-cluster visualization, Komodor fundamentally transforms how platform teams operate complex Cluster API environments. For this AI Cloud Provider, Komodor’s automated intelligence layer turns hours of cross-cluster correlation into instant, actionable insights, empowering them to scale their infrastructure with confidence.
Komodor’s integrated system delivers critical operational confidence. It enables platform teams to:
In essence, Komodor doesn’t just offer observability; it offers automated operational control over complex, custom Cluster API environments, transforming reactive troubleshooting into proactive platform management.
Building on its specialized support for Cluster API and complex CRDs, Komodor is planning to rapidly expand its automated operational control framework in the near future. The primary focus is on extending the capabilities of the AI SRE, Klaudia, to provide deep, end-to-end visibility and autonomous troubleshooting across critical infrastructure-as-code tools.
This expansion includes upcoming support for Crossplane, which will enable platform teams to gain a unified view and rapid Root Cause Analysis (RCA) for their managed cloud resources, and for Terraform, ensuring that automated visibility and diagnostics span the entire infrastructure lifecycle, from initial provisioning to complex day-2 operations.
Want to learn more about Komodor’s Autonomous AI SRE Platform and see Klaudia in action? Book a demo today with one of our AI SRE experts!
Share:
Gain instant visibility into your clusters and resolve issues faster.
May 12 · 9:00EST / 15:00 CET · Live & Online
🎯 8+ Sessions 🎙️ 10+ Speakers ⚡ 100% Free
By registering you agree to our Privacy Policy. No spam. Unsubscribe anytime.
Check your inbox for a confirmation. We'll send session links closer to May 12.