Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of cloud-native.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Discover our events, webinars and other ways to connect.
Here’s what they’re saying about Komodor in the news.
Join the Komodor partner program and accelerate growth.
A Kubernetes node is a physical or virtual machine that runs pods in a Kubernetes cluster. Each node is managed by the control plane and typically runs kubelet, a container runtime, and kube-proxy. Nodes provide the CPU, memory, storage, and networking resources that Kubernetes uses to run containerized workloads.
There are two types of nodes:
This is part of an extensive series of guides about microservices.
A Kubernetes node is a single machine in a cluster that serves as an abstraction. Instead of managing specific physical or virtual machines, you can treat each node as pooled CPU and RAM resources on which you can run containerized workloads. When an application is deployed to the cluster, Kubernetes distributes the work across the nodes. Workloads can be moved seamlessly between nodes in the cluster.
A Kubernetes pod is the smallest unit of management in a Kubernetes cluster. A pod includes one or more containers, and operators can attach additional resources to a pod, such as storage volumes. Pods are stateless by design, meaning they are dispensable and replaced by an identical unit if one fails. A pod has its own IP, allowing pods to communicate with other pods on the same node or other nodes.
The Kubernetes Scheduler, running on the control plane node, is responsible for searching for eligible worker nodes for each pod and deploying it on those nodes. Each pod has a template that defines how many instances of the pod should run and on which types of nodes. When a node fails or has insufficient resources to run a pod, the pod is evicted and rerun on another node.
Itiel Shwartz
Co-Founder & CTO
In my experience, here are tips that can help you better manage Kubernetes nodes:
Keep your nodes updated with the latest security patches and Kubernetes versions.
Use tools like Prometheus and Grafana to monitor node health and performance.
Control pod placement on nodes using taints and tolerations.
Define rules to influence pod scheduling based on node labels.
Distribute workloads evenly across nodes to avoid overloading.
Here are the primary software components that run on every Kubernetes node:
systemctl status kubelet
crictl ps
kubectl get pods -n kube-system -o wide
kubectl get pods -n kube-system
The kubelet is a software agent that runs on Kubernetes nodes and communicates with the cluster control plane. It allows the control plane to monitor the node, see what it is running, and deliver instructions to the container runtime.
When Kubernetes wants to schedule a pod on a specific node, it sends the pod’s PodSpecs to the kubelet. The kubelet reads the details of the containers specified in the PodSpecs, pulls the images from the registry and runs the containers. From that point onwards, the kubelet is responsible for ensuring these containers are healthy and maintaining them according to the declarative configuration.
kube-proxy enables networking on Kubernetes nodes, with network rules that allow communication between pods and entities outside the Kubernetes cluster. kube-proxy either forwards traffic directly or leverages the operating system packet filtering layer.
kube-proxy can run in three different modes: iptables, ipvs, and userspace (a deprecated mode that is not recommended for use). iptables, the default mode, is suitable for clusters of moderate size, however it uses sequential network rules which can impact routing performance. ipvs can support a large number of services, as it supports parallel processing of network rules.
iptables
ipvs
userspace
The container runtime, such as Docker, containerd, or CRI-O, is a software component responsible for running containers on the node. Kubernetes does not take responsibility for stopping and starting containers, and managing basic container lifecycle. The kubelet interfaces with any container engine that supports the Container Runtime Interface (CRI), giving it instructions according to the needs of the Kubernetes cluster.
Kubernetes no longer includes dockershim, which was removed in Kubernetes v1.24. Modern clusters typically use CRI-compatible runtimes such as containerd or CRI-O. If an environment still depends on Docker Engine as the node runtime, it requires a CRI-compatible adapter such as cri-dockerd.
You can use the kubectl command line to view the status of a Kubernetes node.
kubectl
kubectl describe node [node-name]
Here is an example of the status returned by a node:
Name: kubernetes-node-861h Role Labels: kubernetes.io/arch=amd64 kubernetes.io/os=linux kubernetes.io/hostname=kubernetes-node-861h Annotations: node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true Taints: CreationTimestamp: Mon, 04 Sep 2017 17:13:23 +0800 Phase: Conditions: ... Addresses: 10.240.115.55,104.197.0.26 Capacity: ... Allocatable: ... System Info: ...
The most important parts of a node status report are: Addresses, Conditions, Capacity/Allocatable, and System Info. The node status report also shows the node’s taints and tolerations, which tell the Kubernetes scheduler which nodes are more appropriate to a specific node. You can read more about node affinities, taints and tolerations below.
The Addresses section of the node status report can represent the hostname, as reported by the kernel of the node, the external IP of the node, and the internal IP that is routable within the cluster. The way these fields are displayed depends on whether the node is a bare-metal machine or a compute instance running in the cloud.
The Conditions section of the node status report looks like this:
... Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status. MemoryPressure Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status. ...
Here are some of the common conditions that appear in a node status report:
The Capacity and Allocatable sections of the node status report looks like this:
... Capacity: cpu: 2 hugePages: 0 memory: 4046788Ki pods: 110 Allocatable: cpu: 1500m hugePages: 0 memory: 1479263Ki pods: 110 ...
These parameters reflect the node’s available resources, which determine how many pods can run on the node:
The System Info section of the node status report looks like this:
... System Info: Machine ID: 8e025a21a4254e11b028584d9d8b12c4 System UUID: 349075D1-D169-4F25-9F2A-E886850C47E3 Boot ID: 5cd18b37-c5bd-4658-94e0-e436d3f110e0 Kernel Version: 4.4.0-31-generic OS Image: Debian GNU/Linux 8 (jessie) Operating System: linux Architecture: amd64 Container Runtime Version: docker://1.12.5 Kubelet Version: v1.6.9+a3d1dfa6f4335 Kube-Proxy Version: v1.6.9+a3d1dfa6f4335 ExternalID: 15233045891481496305 Non-terminated Pods: (9 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- ... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 900m (60%) 2200m (146%) 1009286400 (66%) 5681286400 (375%) Events: ...
This provides useful information about hardware and software on the node, including:
Here are three criteria you can use to determine the optimal number of nodes in your Kubernetes cluster:
Choosing the right number of Kubernetes nodes is only part of the planning process. You also need to decide how large each node should be, how much spare capacity to keep available, and how the cluster should scale when workloads change.
There is no single ideal node size for every Kubernetes cluster. Smaller nodes can reduce the impact of a single node failure, because fewer pods are affected when one node goes down. Larger nodes can improve resource efficiency and reduce operational overhead, but they also increase the blast radius of a failure. The right choice depends on your workload patterns, availability requirements, pod density, cloud provider limits, and cost strategy.
As a practical starting point, teams should evaluate node sizing across four areas:
Kubernetes supports large clusters, but there are still recommended scale boundaries. As of Kubernetes v1.36, Kubernetes is designed for clusters with no more than 5,000 nodes, 110 pods per node, 150,000 total pods, and 300,000 total containers. These limits are useful when designing high-scale clusters, but most teams should focus first on efficient node pools, resource requests, and autoscaling behavior.
Kubernetes node planning should also account for autoscaling. The Cluster Autoscaler can add or remove nodes from preconfigured node groups. When pods cannot be scheduled because the current cluster does not have enough resources, the autoscaler can add nodes to the node group that best fits the pending pods. When nodes are underused and workloads can safely run elsewhere, it can remove unnecessary nodes.
For autoscaling to work well, pods need accurate CPU and memory requests. If requests are too low, Kubernetes may pack too many pods onto a node and create resource pressure. If requests are too high, the autoscaler may add more nodes than necessary and increase infrastructure costs.
A strong node autoscaling setup usually includes:
Node sizing is not a one-time decision. As workloads grow, teams should regularly review node utilization, pod density, failed scheduling events, and autoscaler activity to make sure the cluster has enough capacity without wasting compute.
Kubernetes allows you to flexibly control which nodes should run your pods. It is possible to manually assign a pod to a node, but in most cases, you will define a mechanism that allows Kubernetes to dynamically assign pods to nodes. Two of these mechanisms are node selectors and node affinity.
Both node selectors and affinity are closely tied to Kubernetes labels. A label is a metadata you can attach to a Kubernetes resource, which lets you identify and manage it.
A node selector lets you specify which nodes the pod should be deployed on. The Kubernetes scheduler reads the pod template (also called pod specification), searches for eligible nodes and deploys the pod.
The simplest type of node selection is the nodeSelector field of the podSpec. It is a set of key-value pairs, which lets you define labels that the node needs to match in order to be eligible to run the pod. This is known as a label selector.
nodeSelector
podSpec
Node affinities provide an expressive language you can use to define which nodes to run a pod on. You can define:
Node affinity is conceptually similar to nodeSelector – it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Taints are the opposite of affinity – a taint is like defining that a node “doesn’t like” a certain set of pods and those pods will, if possible, not schedule on the node. A node can have one or more taints defined on it.
You can define tolerations in pods templates, to indicate that despite a taint, you want to allow – not require – the pod to run on nodes that have a matching “taint”.
You can taints and tolerations to ensure pods are not scheduled onto nodes that are not appropriate for them.
Kubernetes node errors indicate an issue on a machine participating in a Kubernetes cluster, which can affect its ability to run and manage pods. Below are two common errors and what you can do about them.
If a node has a NotReady status for over five minutes, the status of pods running on it becomes Unknown, and new pods fail with ContainerCreating error.
NotReady
Unknown
ContainerCreating
How to identify the issue
Resolving the issueIn some cases, this issue will be resolved on its own if the node is able to recover or the user reboots it. If this doesn’t happen, you can remove the failed node from the cluster using the kubectl delete node command.
kubectl delete node
Learn more about Node Not Ready issues in Kubernetes.
This error indicates that kubelet is not running properly on the node, so it cannot participate in the Kubernetes cluster.
How to identify the issueRun systemctl status kubelet and look for the message node [node-name] not found
node [node-name] not found
Resolving the issueA common way to resolve this issue is to reset the node using the kubeadm reset command, use kubeadm to recreate a token, and then use the new token in a kubectl join command.
kubeadm reset command
kubeadm
kubectl join
Node issues rarely stay isolated. A NotReady node, DiskPressure condition, failed kubelet, noisy neighbor, or container runtime problem can quickly turn into pod failures, delayed deployments, failed scheduling events, and service-level incidents.
Komodor helps teams troubleshoot Kubernetes node issues by connecting node health, workload behavior, recent changes, events, alerts, logs, and ownership context in one place. Instead of investigating node problems manually across kubectl, dashboards, logs, alerts, and tickets, teams can use Komodor’s AI SRE platform to detect, investigate, and remediate issues faster.
With Komodor, teams can:
For example, if pods are stuck in ContainerCreating because a node is NotReady, Komodor can help connect the pod symptoms to the underlying node condition, recent cluster changes, related events, and impacted services. If a node is under MemoryPressure or DiskPressure, Komodor can help teams understand which workloads are consuming resources and whether the issue is isolated to one node, one node pool, or a larger cluster-wide pattern.
This is especially useful in multi-cluster environments, where node issues can be difficult to prioritize manually. Komodor gives teams a centralized view of Kubernetes health and reliability, while Klaudia helps investigate complex incidents, surface likely root causes, and recommend the next best action.
A Kubernetes node is a physical or virtual machine that runs workloads in a Kubernetes cluster. Nodes provide the CPU, memory, storage, and networking resources that pods need to run. Each node is managed by the Kubernetes control plane.
A Kubernetes node usually runs three main components: kubelet, a container runtime, and kube-proxy. Kubelet communicates with the control plane and makes sure pods are running as expected. The container runtime runs the containers, and kube-proxy helps manage network traffic for Kubernetes Services.
A node is the machine that provides compute resources. A pod is the smallest deployable workload unit in Kubernetes. In simple terms, pods run on nodes, and nodes provide the infrastructure that pods need.
A control plane node runs the components that manage the cluster, such as the API server, scheduler, controller manager, and etcd. A worker node runs application workloads. In production environments, the control plane and worker nodes are often separated for reliability and scalability.
Node NotReady means Kubernetes does not consider the node healthy enough to run workloads. This can happen because of kubelet problems, network issues, container runtime failures, resource pressure, or communication problems between the node and the control plane.
DiskPressure happens when a node is running low on available disk space. Common causes include too many container images, large logs, high ephemeral storage usage, or workloads writing too much data to the node filesystem. Kubernetes may evict pods if disk pressure becomes severe.
MemoryPressure happens when a node is running low on available memory. This is often caused by workloads using more memory than expected, missing or inaccurate memory requests and limits, memory leaks, or too many pods running on the same node.
The default recommended limit is usually 110 pods per node. The actual number can vary depending on cluster configuration, cloud provider limits, networking setup, workload resource usage, and operational requirements. Most teams should size nodes based on real CPU, memory, storage, and networking needs rather than aiming for the maximum pod count.
Kubernetes can support very large clusters, but the right number of nodes depends on workload size, availability needs, cost strategy, cloud provider quotas, and operational complexity. Instead of choosing a fixed node count, teams should design around workload requirements, failure domains, autoscaling behavior, and resource efficiency.
Fewer large nodes can improve resource efficiency and reduce management overhead, but each node failure can affect more workloads. More small nodes can reduce the impact of a single node failure and improve workload distribution, but they add more node-level overhead. The best choice depends on workload patterns, availability requirements, and cost goals.
Kubernetes node autoscaling adjusts the number of nodes in a cluster based on workload demand. When pods cannot be scheduled because there is not enough available capacity, autoscaling can add nodes. When nodes are underused and workloads can safely move elsewhere, autoscaling can remove unnecessary nodes.
Taints are applied to nodes to prevent pods from being scheduled there unless the pods have matching tolerations. They are commonly used for dedicated nodes, GPU nodes, infrastructure nodes, or workloads that need special isolation.
Pods can stay in Pending when Kubernetes cannot find a suitable node to run them. Common reasons include insufficient CPU or memory, node taints without matching tolerations, strict node affinity rules, unavailable node pools, or cluster autoscaling problems.
Pods may get stuck in ContainerCreating when the assigned node has problems pulling images, mounting volumes, connecting to the container runtime, configuring networking, or communicating with required services. Checking node events, pod events, kubelet logs, and container runtime status can help identify the cause.
AI SRE tools can help by correlating node health, pod behavior, recent changes, alerts, logs, and events in one place. For node issues like NotReady, DiskPressure, failed scheduling, or noisy neighbors, an AI SRE platform can help teams find the likely root cause faster and identify the next remediation steps.
Share:
Gain instant visibility into your clusters and resolve issues faster.
May 12 · 9:00EST / 15:00 CET · Live & Online
🎯 8+ Sessions 🎙️ 10+ Speakers ⚡ 100% Free
By registering you agree to our Privacy Policy. No spam. Unsubscribe anytime.
Check your inbox for a confirmation. We'll send session links closer to May 12.