Cluster API (CAPI) is transforming how organizations deploy and manage fleets of Kubernetes clusters by introducing declarative, Kubernetes-style APIs to automate cluster provisioning and lifecycle management. While CAPI excels at creating consistent and repeatable cluster deployments across different infrastructure providers, operating it at a massive scale introduces unique day-to-day challenges. Many organizations running highly customized CAPI deployments face operational hurdles, specifically around maintaining end-to-end visibility. Komodor partnered with a leading AI Cloud Provider to tackle these operational hurdles. By integrating Komodor and our AI SRE, Klaudia, we successfully bridged the visibility gaps in their highly customized CAPI infrastructure. Here is a look at the challenges they faced and how Komodor's targeted support for Cluster API and an extensible framework for 100s of highly bespoke Custom Resource Definitions (CRDs) solved them. The Challenge: Dual-Cluster Architectures and the "Hidden" Status Gap The AI Cloud Provider operates a complex, multi-tier cluster hierarchy spanning several regions to support its infrastructure and managed customer clusters. A core component of their setup is a dual-cluster architecture, meaning every physical node has corresponding resources split across two separate environments: Management Cluster: This houses standard CAPI resources (such as Machines, MachineSets, and MachineDeployments) alongside custom hardware management CRDs. Workload Cluster: This contains the actual Kubernetes Node objects, as well as critical custom lifecycle CRDs, specifically NodeMaintenance and NodeAcceptance. While this architecture is highly scalable, standard CAPI tooling only has visibility into the management cluster. Because actual node health relies heavily on the custom CRDs located in the workload cluster, engineers frequently encountered a "hidden" status gap. A Machine might display a "Running" or "Healthy" status in the management cluster, but the actual Node could be technically "NotReady" or stuck with a SchedulingDisabled taint due to an ongoing maintenance workflow orchestrated by the NodeMaintenance CRD. Because standard tools couldn't see this complete picture, troubleshooting a single node lifecycle issue required engineers to manually correlate resources across both clusters, read operator logs, and check custom resources, a tedious process that took 20 to 40 minutes per incident. The Solution: Purpose-Built AI SRE for ClusterAPI To resolve this massive operational bottleneck, Komodor developed a purpose-built interface for managing and troubleshooting CAPI infrastructure that intrinsically understands dual-cluster architectures and custom CRDs. Here is how Komodor and Klaudia eliminate the manual toil of cross-cluster investigations: 1. Multi-Tier Cross-Cluster Visualization Standard metrics fail to capture the conceptual complexity of dual-cluster environments. Komodor solves this by providing a multi-tier infrastructure overview, a consolidated "map" of the AI Cloud Provider's entire estate, organized by hierarchy and management relationships. Furthermore, Komodor features interactive relationship graphs that visualize the complete topology from the top-level Cluster down to individual Nodes across both management and workload clusters. This cross-cluster relationship visualization allows engineers to instantly identify broken links between CAPI Machines and workload Nodes, making the complex architecture easily navigable. 2. Deterministic Mapping of Custom CRDs Standard CAPI logic is insufficient for highly customized architectures. To accurately determine node health, Klaudia was enhanced to understand the AI Cloud Provider's specific operational logic. By establishing a strict "contract" (utilizing annotations or direct references), Klaudia deterministically links standard Kubernetes Nodes to their governing NodeMaintenance and NodeAcceptance CRDs. 3. Automated Root Cause Analysis (RCA) in