• Home
  • Komodor Blog
  • AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

IP address exhaustion in Kubernetes doesn’t announce itself with clear error messages. Pods fail to schedule, services degrade unpredictably, and the symptoms look like a dozen different problems before anyone realizes the cluster has run out of available IP addresses. By the time the root cause becomes clear, multiple services are affected and recovery requires coordination across infrastructure layers.

This scenario walks through an AWS CNI IP exhaustion incident where 15 services experienced outages before platform teams identified that the cluster had consumed all available IP addresses in its subnet allocation.

The Incident: Cascading Service Failures Without Clear Cause

Komodor | AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

Services start experiencing partial outages across the cluster. Some pods fail to start, others run but can’t communicate with dependencies, and autoscaling stops working despite clear demand for additional capacity. Each service team reports different symptoms, making it unclear whether these are related incidents or coincidental failures.

From the application teams’ perspective, their services were running fine until they suddenly weren’t. From the platform team’s perspective, the cluster has available compute capacity but pods aren’t scheduling properly. The disconnect between available resources and scheduling failures suggests a systemic issue, but the specific constraint isn’t obvious.

Before AI: The Layered Infrastructure Investigation

The on-call engineer starts with standard diagnostics. They check pod status and see multiple pods stuck in pending or container creating states. The events show vague errors about network interface allocation failures, but these don’t immediately point to IP exhaustion as the root cause.

They examine cluster capacity and see that CPU and memory are available, so it’s not a simple resource constraint. The investigation expands to check the continuous delivery pipeline for recent changes that might have affected networking. Nothing obvious appears in recent deployments.

At this point, the scope of the issue becomes clearer. The engineer checks logs across multiple services and realizes that 15 different services are experiencing similar failures. This suggests a cluster-wide networking problem rather than application-specific issues.

A senior engineer with networking expertise joins the investigation. They need to inspect metrics and traces to understand the failure pattern, look for errors that might indicate what’s constraining pod scheduling, and examine the AWS CNI configuration to see if there are network-layer issues.

The investigation eventually focuses on IP address allocation. They check the subnet configuration and discover that the cluster has consumed all available IP addresses. The AWS CNI plugin can’t assign IPs to new pods because the subnet is exhausted. Autoscaling can’t add capacity because new pods would face the same IP allocation failures.

Now they need to understand how this happened. They examine the pod IP allocation patterns, check whether there’s IP address leakage from terminated pods, and look at the subnet sizing to determine if it was provisioned correctly for the cluster’s workload density.

The remediation requires multiple steps. They need to inspect metrics and traces to identify which services are most critical, look for errors that might help prioritize recovery, and coordinate across teams to determine which pods can be terminated to free up IPs for more critical services. They also need to add additional subnet capacity, but this requires infrastructure changes that take time to implement.

Result: 1-2 engineers, 3-6 hours for initial investigation, 12 hours total to implement subnet expansion and restore all services, high expertise in AWS networking and CNI configuration required.

The incident gets resolved by expanding subnet capacity and optimizing IP allocation, but 15 services experienced outages during the investigation and remediation process.

With AI SRE: Immediate CNI Resource Analysis

The same service failures trigger Klaudia’s detection as the pod scheduling pattern emerges. Instead of sequentially checking compute resources, then networking configuration, then subnet capacity, the AI simultaneously analyzes pod scheduling failures, CNI plugin behavior, subnet utilization, and IP allocation patterns.

Klaudia identifies the root cause immediately: AWS CNI IP exhaustion. The cluster has consumed all available IP addresses in its subnet allocation, which prevents new pods from starting and blocks autoscaling. The AI correlates this with the 15 affected services and recognizes this as a resource exhaustion issue that requires immediate subnet expansion.

The AI provides comprehensive root cause analysis with full context. It shows the current subnet utilization, identifies how many IP addresses are allocated versus available, maps which services are affected, and explains why autoscaling isn’t working (new pods can’t get IP addresses).

Klaudia also provides the immediate remediation path:

  • Add additional subnet capacity to the cluster to allow new pod scheduling. 
  • Review IP allocation patterns to identify whether any terminated pods are holding IPs unnecessarily. 
  • Implement IP address management best practices to prevent future exhaustion.
  • Prioritize which services should be restored first based on criticality.

The engineer handling the incident doesn’t need deep expertise in AWS networking architecture or CNI plugin internals. The AI has already identified the constraint and provided the specific actions needed to restore service and prevent recurrence.

Result: 1 engineer, 5 minutes to RCA and begin subnet expansion, no specialized AWS networking expertise required.

Why IP Exhaustion Is Hard to Diagnose

IP exhaustion creates symptoms that look like many other problems. Pods fail to schedule, which could be resource constraints. Network communication fails, which could be policy issues. Autoscaling doesn’t work, which could be cluster autoscaler configuration. Each symptom suggests different root causes, and teams often investigate multiple hypotheses before discovering the actual constraint.

This creates an investigation problem across cloud provider networking layers. Platform teams need to understand how Kubernetes networking integrates with AWS VPC networking, how the CNI plugin manages IP allocation, how subnet sizing affects cluster capacity, and how terminated pods release their IP addresses. Each layer requires specialized knowledge.

The blast radius assessment is complex because IP exhaustion affects services unpredictably. Services that need to scale during the incident can’t get new pods scheduled. Services that experience pod failures can’t recover because replacement pods can’t get IPs. The cascading effect makes it difficult to determine which services are primarily affected versus experiencing secondary impacts.

As cluster sizes grow and workload density increases, IP exhaustion becomes more likely. Teams provision subnets based on expected cluster size, but actual IP consumption depends on pod churn rate, IP allocation efficiency, and whether terminated pods release their IPs promptly. Predicting when exhaustion will occur requires understanding these dynamic factors.

Productivity Gains for Platform Teams

The productivity gain is significant: reducing a 12-hour incident affecting 15 services to a 5-minute diagnosis with immediate subnet expansion. But the operational change matters more for teams managing Kubernetes on AWS.

When IP exhaustion gets identified immediately instead of after hours of investigation, platform teams can provision subnets more aggressively without risking extended outages. They get instant feedback on when clusters are approaching IP limits, which makes capacity planning more proactive.

For organizations running multiple clusters on AWS, this pattern recognition becomes critical. Each cluster might have different subnet configurations, different workload densities, and different IP consumption patterns. AI trained on cross-cluster telemetry understands these variations and identifies IP exhaustion regardless of the specific cluster configuration.

This enables more confident infrastructure scaling. Platform teams don’t need to overprovision subnets out of fear that IP exhaustion will cause extended outages. They can size subnets appropriately for expected workload while knowing that any exhaustion issues will be identified and resolved quickly.

Beyond AWS CNI

While this scenario focuses on AWS CNI IP exhaustion, the same investigation pattern applies to other networking constraints in Kubernetes. Azure CNI IP exhaustion in AKS clusters. Calico IP pool exhaustion in on-premise deployments. Service mesh sidecar injection failures that consume additional IPs. Load balancer limits that prevent service exposure.

All of these create similar symptoms: pods fail to schedule or communicate despite available compute resources. All of them require understanding how Kubernetes networking integrates with underlying infrastructure. All of them benefit from pattern recognition that identifies networking constraints immediately rather than after eliminating other possibilities.

Telemetry-trained AI handles these variations because it’s learned the underlying patterns for network resource exhaustion across different CNI implementations and cloud providers. It knows how to distinguish IP exhaustion from other networking issues and provides the appropriate remediation path for the specific environment.

Preventing Future Exhaustion

Beyond immediate incident resolution, AI-augmented investigation helps prevent recurrence. Klaudia identifies not just that IP exhaustion occurred, but why the cluster consumed its IP allocation faster than expected. Was it excessive pod churn? Inefficient IP allocation by the CNI plugin? Terminated pods holding IPs longer than necessary? Subnet sizing that didn’t account for actual workload density?

Understanding these factors helps platform teams implement better capacity planning. They can adjust subnet sizing based on actual consumption patterns rather than theoretical calculations. They can optimize IP allocation efficiency to reduce waste. They can implement monitoring that alerts before exhaustion occurs rather than after services are affected.

This proactive capability comes from observing how IP exhaustion manifests across hundreds of clusters. The AI doesn’t just know that subnets can be exhausted, it knows which specific patterns predict exhaustion and which remediation strategies actually prevent recurrence in production environments.

This is what AI-augmented investigation delivers for networking incidents. Not a tool that explains CNI architecture, but a system that knows when clusters are experiencing IP exhaustion, why it happened, and what actions restore service and prevent future occurrences. The knowledge comes from observing thousands of networking incidents across production Kubernetes infrastructure, not from reading AWS documentation.

This was part six of an ongoing series on AI SRE in actual production practice. If you missed the previous parts, you can find them here: