Kubernetes Disk Pressure: 4 Common Causes and How to Fix It

What Is Node Disk Pressure in Kubernetes? 

Node disk pressure is a condition in Kubernetes that indicates a node is running low on disk space. This situation can affect the node’s ability to host pods effectively, as Kubernetes relies on sufficient disk space for operations such as pulling images, running containers, and storing logs. 

When disk usage crosses a certain threshold, Kubernetes marks the node with a DiskPressure condition, signaling that it’s in a state that could compromise its performance or functionality.

Once a node is under disk pressure, the Kubernetes scheduler stops scheduling new pods to that node to prevent exacerbating the condition. Existing pods may continue running, but the system may evict pods that consume a lot of disk space to alleviate the pressure. 

To quickly check if your Kubernetes nodes are experiencing disk pressure: Run the command kubectl describe node <node-name> and check if any of your nodes appear with condition type DiskPressure and status True

In the output, take a look at the conditions section. It will look something like this—you can see that node2 is experiencing disk pressure:

NAME            STATUS   ROLES    AGE   VERSION   CONDITIONS
node1 Ready master 68d v1.20.2 DiskPressure=False,MemoryPressure=False,PIDPressure=False,Ready=True
node2 Ready <none> 68d v1.20.2 DiskPressure=True,MemoryPressure=False,PIDPressure=False,Ready=True
node3 Ready <none> 68d v1.20.2 DiskPressure=False,MemoryPressure=False,PIDPressure=False,Ready=True

This is part of a series of articles about Kubernetes versions.

Why Should You Care About Node Disk Pressure in Kubernetes?

A DiskPressure condition can result in a few serious issues:

Pod Eviction and Scheduling

Kubernetes may evict pods from a node experiencing disk pressure to reclaim resources and mitigate the risk of system failures. This eviction process is automated and prioritizes the eviction of pods based on their resource requests and limits, as well as the QoS class. Consequently, critical pods may experience unexpected downtime or disruptions, affecting application availability.

The Kubernetes scheduler prevents new pods from being scheduled onto nodes marked with disk pressure to preserve the node’s stability and performance. This constraint can lead to scheduling delays or failures if multiple nodes in a cluster are under disk pressure, potentially impacting application deployment and scaling activities.

Cluster Performance and Stability

As nodes struggle with limited disk space, they may become less responsive or experience increased latencies, affecting the user experience of applications hosted on the cluster. Persistent disk pressure across multiple nodes can lead to a cascading effect, where the performance of the entire cluster is compromised.

Cluster stability is also at risk during disk pressure situations. Essential system components like etcd, the Kubernetes API server, and kubelet might be affected by disk space shortages, leading to cluster-wide issues such as failures in service discovery, networking problems, and delays in executing control commands. 

Kubernetes Operations

Node disk pressure impacts Kubernetes operations by limiting the cluster’s ability to scale and recover from failures. For instance, in a scenario where a cluster needs to scale up rapidly due to increased demand, disk pressure can prevent new pods from being scheduled, leading to service degradation or outages. 

Automated recovery processes, such as pod rescheduling after a node failure, may be hindered if alternative nodes are also experiencing disk pressure.

Possible Causes for Node Disk Pressure 

Here are some of the main conditions that may result in node disk pressure.

1. Application Logs and Data Stored on Local Node Storage

Storing application logs and data on a node’s local disk rather than a network-attached storage (NAS), shared file system, or other external storage solution, can lead to disk pressure. As applications run, they generate logs and data that can quickly accumulate, consuming significant disk space. Without proper log rotation or data management practices, the disk space on the node can be exhausted, triggering disk pressure conditions.

2. Node Is Running Too Many Pods

Running too many pods on a single node, especially if they generate a high volume of data, can cause disk pressure. Each pod can produce logs, temporary files, and persistent data, contributing to the overall disk usage on the node. As the number of pods increases, the cumulative disk space required can exceed the available capacity, leading to disk pressure. 

Administrators should implement pod resource limits and use node affinity and anti-affinity rules to distribute workloads evenly across the cluster’s nodes.

3. Node Is Running Pods with Misconfigured Resource Limits

Without properly configured requests and limits for storage resources, pods can consume more disk space than anticipated. This oversight can lead to a rapid exhaustion of disk resources.

Establishing clear resource limits and monitoring pod storage consumption are essential practices to prevent disk pressure. Kubernetes provides mechanisms like ResourceQuotas and LimitRanges to help administrators control resource usage across namespaces and pods, reducing the risk of disk pressure due to misconfiguration.

4. Node Is Running Pods with Misconfigured Storage Requests

Storage requests that are set too low may not reflect the actual disk space requirements of an application, leading to under-provisioning. As the application operates, it may consume more disk space than reserved, contributing to disk pressure on the node.

Accurate configuration of storage requests is crucial for preventing disk pressure. Kubernetes offers PersistentVolumes and PersistentVolumeClaims as part of its storage orchestration, allowing for precise allocation and management of storage resources.

Troubleshooting and Resolving Kubernetes Node Disk Pressure

The troubleshooting process for disk pressure involves the following steps.

Identifying Node Disk Pressure

Administrators can use the kubectl get nodes command to check the status of nodes in the cluster. Nodes experiencing disk pressure will have a condition type DiskPressure with a status True

Additionally, examining the output of kubectl describe node provides detailed insights into node conditions, events, and allocated resources, helping identify potential causes of disk pressure.

Analyzing Pod Disk Usage

The kubectl describe pod command offers information on resource consumption, including storage. For a more detailed analysis, administrators can use logging and monitoring tools to track disk usage over time. Tools like Grafana can visualize disk usage metrics collected by Prometheus, identifying pods with high disk consumption. 

Running commands like du -sh inside a pod’s containers can also provide immediate insights into disk usage within the container.

Resolving Node Disk Pressure

Resolving node disk pressure involves both immediate actions and long-term strategies. 

Short term solutions

Immediately, administrators can delete unused images and containers using commands like docker system prune or Kubernetes’s own garbage collection mechanisms. Evicting non-critical pods manually or adjusting pod resource limits can free up disk space temporarily.

Long term solutions

For long-term prevention, implementing storage management practices is crucial. This includes configuring appropriate resource limits, using external storage solutions for logs and data, and enforcing pod anti-affinity rules to prevent overloading nodes. 

Tools and policies for automated log rotation and efficient image management further help maintain optimal disk usage across the cluster.

You can use this command to set pod resource limits:

kubectl set resources deployment my-deployment 
--limits=memory=200Mi,cpu=1

Here is a pod manifest that defines anti-affinity, ensuring pods are spread out across different nodes, reducing the likelihood of any single node facing disk pressure due to pod density. 

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: nginx-container
image: nginx
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- my-app
topologyKey: "kubernetes.io/hostname"

Solving Kubernetes Node Errors Once and for All with Komodor

Kubernetes troubleshooting relies on the ability to quickly contextualize the problem with what’s happening in the rest of the cluster. More often than not, you will be conducting your investigation during fires in production. The major challenge is correlating service-level incidents with other events happening in the underlying infrastructure.

Komodor can help with its ‘Node Status’ view, built to pinpoint correlations between service or deployment issues and changes in the underlying node infrastructure. With this view you can rapidly:

  • See service-to-node associations
  • Correlate service and node health issues
  • Gain visibility over node capacity allocations, restrictions, and limitations
  • Identify “noisy neighbors” that use up cluster resources
  • Keep track of changes in managed clusters
  • Get fast access to historical node-level event data

Beyond node error remediations, can help troubleshoot a variety of Kubernetes errors and issues, acting as a single source of truth (SSOT) for all of your K8s troubleshooting needs.

In general, Komodor is the go-to solution for managing Kubernetes at scale. Komodor’s platform continuously monitors, analyzes, and visualizes data across your Kubernetes infrastructure, providing clear, actionable insights that make it significantly easier to maintain reliability, troubleshoot in real-time, and optimize costs.

Designed for complex, multi-cluster, and hybrid environments, Komodor bridges the Kubernetes knowledge gap, empowering both infrastructure and application teams to move beyond firefighting. By facilitating collaboration, we help improve operational efficiency, reduce Mean Time to Recovery (MTTR), and accelerate development velocity. 

If you are interested in checking out Komodor, use this link to sign up for a Free Trial.