Komodor is an autonomous AI SRE platform for Kubernetes. Powered by Klaudia, it’s an agentic AI solution for visualizing, troubleshooting and optimizing cloud-native infrastructure, allowing enterprises to operate Kubernetes at scale.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Guides, blogs, webinars & tools to help you troubleshoot and scale Kubernetes.
Tips, trends, and lessons from the field.
Practical guides for real-world K8s ops.
How it works, how to run it, and how not to break it.
Short, clear articles on Kubernetes concepts, best practices, and troubleshooting.
Infra stories from teams like yours, brief, honest, and right to the point.
Product-focused clips showing Komodor in action, from drift detection to add‑on support.
Live demos, real use cases, and expert Q&A, all up-to-date.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Who we are, and our promise for the future of cloud-native.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Discover our events, webinars and other ways to connect.
Here’s what they’re saying about Komodor in the news.
Join the Komodor partner program and accelerate growth.
Kubernetes troubleshooting is the process of identifying, diagnosing, and resolving issues in Kubernetes clusters, nodes, pods, or containers.
More broadly defined, Kubernetes troubleshooting also includes effective ongoing management of faults and taking measures to prevent issues in Kubernetes components.
This article will focus on:
Use this Kubernetes troubleshooting cheat sheet when something breaks and you need to quickly narrow down where the problem is: the application, pod, node, service, network, storage, or control plane. Start with the symptom, identify the most likely layer, run the first diagnostic command, and apply the safest fix only after checking events, logs, and recent changes.
Pending
FailedScheduling
CrashLoopBackOff
ImagePullBackOff
ErrImagePull
imagePullSecret
CreateContainerConfigError
OOMKilled
137
Running
NotReady
kubectl
Effective Kubernetes troubleshooting depends on three things: understanding the issue, managing the incident safely, and preventing the same problem from recurring. Instead of starting with a long list of tools, start with the signals each tool or system gives you during an incident.
A Kubernetes incident usually leaves clues across events, logs, metrics, traces, deployments, Git changes, configuration, alerts, and ownership data. The faster you connect those signals, the faster you can move from “something is broken” to “this changed, this failed, and this is the safest fix.”
The first goal is to understand where the problem is happening and what changed before it started. In Kubernetes, the same symptom can come from the application, pod, node, service, network, storage layer, control plane, or a recent deployment.
Use the signals below to narrow the blast radius and identify the likely cause.
kubectl describe pod <pod-name>
kubectl get events -n <namespace> --sort-by=.lastTimestamp
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> --previous
kubectl rollout history deployment/<name>
This signal-based view helps teams avoid guessing. For example, a CrashLoopBackOff should usually start with previous container logs and recent deployment changes, while a Pending pod should start with events, scheduler messages, node capacity, taints, affinity, and quotas.
Once you understand the likely cause, the next step is to manage the incident without making the blast radius worse. The goal is not to apply the fastest possible command. The goal is to choose the safest action based on the evidence.
Use this troubleshooting workflow during active incidents:
kubectl describe
For example, if a service starts returning 503 errors after a deployment, do not start by restarting random pods. Check the deployment history, Service selectors, EndpointSlices, readiness probes, pod events, and ingress/controller logs first. If the new rollout removed all ready endpoints, the safest fix may be if a service starts returning 503 errors after a deployment, do not start by restarting random pods. Check the deployment history, Service selectors, EndpointSlices, readiness probes, pod events, and a rollback or a readiness probe correction, not a cluster-wide restart.
After the incident is resolved, use the same signals to prevent recurrence. The best Kubernetes teams do not only ask “what fixed it?” They ask “what signal would have shown this earlier, who needed to see it, and how do we make the fix permanent?”
Use this prevention map after each incident:
This makes troubleshooting more repeatable. Events show what Kubernetes observed, logs show what the application experienced, metrics show resource and performance impact, traces show request flow, deployments and Git changes show what changed, configs show what the workload expected, alerts show how the issue surfaced, and ownership data shows who can fix it.
Itiel Shwartz
Co-Founder & CTO
In my experience, here are tips that can help you better troubleshoot Kubernetes:
Organize your clusters by namespaces to isolate different environments (e.g., dev, staging, prod) for more targeted troubleshooting.
Turn on Kubernetes audit logging to capture detailed events for security and debugging purposes.
Set resource quotas to prevent a single application from exhausting cluster resources, which can lead to easier identification of resource contention issues.
Implement automated health checks and alerts for node conditions such as disk pressure, memory pressure, and network availability.
Use operators to automate complex application lifecycles and maintain stateful applications more reliably.
Kubernetes is a complex system, and troubleshooting issues that occur somewhere in a Kubernetes cluster is just as complicated.
Even in a small, local Kubernetes cluster, it can be difficult to diagnose and resolve issues, because an issue can represent a problem in an individual container, in one or more pods, in a controller, a control plane component, or more than one of these.
In a large-scale production environment, these issues are exacerbated, due to the low level of visibility and a large number of moving parts. Teams must use multiple tools to gather the data required for troubleshooting and may have to use additional tools to diagnose issues they detect and resolve them.
To make matters worse, Kubernetes is often used to build microservices applications, in which each microservice is developed by a separate team. In other cases, there are DevOps and application development teams collaborating on the same Kubernetes cluster. This creates a lack of clarity about division of responsibility – if there is a problem with a pod, is that a DevOps problem, or something to be resolved by the relevant application team?
In short – Kubernetes troubleshooting can quickly become a mess, waste major resources and impact users and application functionality – unless teams closely coordinate and have the right tools available.
If you are experiencing one of these common Kubernetes errors, here’s a quick guide to identifying and resolving the problem:
This error is usually the result of a missing Secret or ConfigMap. Secrets are Kubernetes objects used to store sensitive information like database credentials. ConfigMaps store data as key-value pairs, and are typically used to hold configuration information used by multiple pods.
Run kubectl get pods .
kubectl get pods
Check the output to see if the pod’s status is CreateContainerConfigError
$ kubectl get pods NAME READY STATUS RESTARTS AGE pod-missing-config 0/1 CreateContainerConfigError 0 1m23s
To get more information about the issue, run kubectl describe [name] and look for a message indicating which ConfigMap is missing:
kubectl describe [name]
$ kubectl describe pod pod-missing-config Warning Failed 34s (x6 over 1m45s) kubelet Error: configmap "configmap-3" not found
Now run this command to see if the ConfigMap exists in the cluster.
For example $ kubectl get configmap configmap-3
$ kubectl get configmap configmap-3
If the result is null, the ConfigMap is missing, and you need to create it. See the documentation to learn how to create a ConfigMap with the name requested by your pod.
null
Make sure the ConfigMap is available by running get configmap [name] again. If you want to view the content of the ConfigMap in YAML format, add the flag -o yaml.
get configmap [name]
-o yaml
Once you have verified the ConfigMap exists, run kubectl get pods again, and verify the pod is in status Running:
$ kubectl get pods NAME READY STATUS RESTARTS AGE pod-missing-config 0/1 Running 0 1m23s
This status means that a pod could not run because it attempted to pull a container image from a registry, and failed. The pod refuses to start because it cannot create one or more containers defined in its manifest.
Run the command kubectl get pods
command kubectl get pods
Check the output to see if the pod status is ImagePullBackOff or ErrImagePull:
$ kubectl get pods NAME READY STATUS RESTARTS AGE mypod-1 0/1 ImagePullBackOff 0 58s
Run the kubectl describe pod [name] command for the problematic pod.
kubectl describe pod [name]
The output of this command will indicate the root cause of the issue. This can be one of the following:
docker pull
CrashLoopBackOff means a container started, crashed, and Kubernetes is delaying the next restart attempt because the container keeps failing repeatedly. It does not usually mean the pod cannot be scheduled. Scheduling problems typically show up as Pending or FailedScheduling.
Kubernetes restarts failed containers according to the pod’s restart policy. When the same container crashes again and again, Kubernetes applies an exponential backoff delay between restart attempts to avoid restarting the failing container too aggressively.
Run:
kubectl get pods -n <namespace>
Check whether the affected pod shows CrashLoopBackOff and whether the RESTARTS count keeps increasing.
RESTARTS
When a worker node shuts down or crashes, all stateful pods that reside on it become unavailable, and the node status appears as NotReady.
If a node has a NotReady status for over five minutes (by default), Kubernetes changes the status of pods scheduled on it to Unknown, and attempts to schedule it on another node, with status ContainerCreating.
Unknown
ContainerCreating
Run the command kubectl get nodes.
kubectl get nodes
Check the output to see is the node status is NotReady
NAME STATUS AGE VERSION mynode-1 NotReady 1h v1.2.0
To check if pods scheduled on your node are being moved to other nodes, run the command get pods.
get pods
Check the output to see if a pod appears twice on two different nodes, as follows:
NAME READY STATUS RESTARTS AGE IP NODE mypod-1 1/1 Unknown 0 10m [IP] mynode-1 mypod-1 0/1 ContainerCreating 0 15s [none] mynode-2
If the failed node is able to recover or is rebooted by the user, the issue will resolve itself. Once the failed node recovers and joins the cluster, the following process takes place:
If you have no time to wait, or the node does not recover, you’ll need to help Kubernetes reschedule the stateful pods on another, working node. There are two ways to achieve this:
kubectl delete node [name]
kubectl delete pods [pod_name] --grace-period=0 --force -n [namespace]
Learn more about Node Not Ready issues in Kubernetes.
If you’re experiencing an issue with a Kubernetes pod, and you couldn’t find and quickly resolve the error in the section above, here is how to dig a bit deeper. The first step to diagnosing pod issues is running kubectl describe pod [name].
Here is example output of the describe pod command, provided in the Kubernetes documentation:
Name: nginx-deployment-1006230814-6winp Namespace: default Node: kubernetes-node-wul5/10.240.0.9 Start Time: Thu, 24 Mar 2016 01:39:49 +0000 Labels: app=nginx,pod-template-hash=1006230814 Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16" ... Status: Running IP: 10.244.0.6 Controllers: ReplicaSet/nginx-deployment-1006230814 Containers: nginx: Container ID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149 Image: nginx Image ID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707 Port: 80/TCP QoS Tier: cpu: Guaranteed memory: Guaranteed Limits: cpu: 500m memory: 128Mi Requests: memory: 128Mi cpu: 500m State: Running Started: Thu, 24 Mar 2016 01:39:51 +0000 Ready: True Restart Count: 0 Environment: [none] Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: default-token-4bcbi: Type: Secret (a volume populated by a Secret) SecretName: default-token-4bcbi Optional: false QoS Class: Guaranteed Node-Selectors: [none] Tolerations: [none] Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 54s 54s 1 {default-scheduler } Normal Scheduled Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5 54s 54s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulling pulling image "nginx" 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulled Successfully pulled image "nginx" 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Created Created container with docker id 90315cc9f513 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Started Started container with docker id 90315cc9f513
We bolded the most important sections in the describe pod output:
describe pod
Name
Status
Containers
Containers:State
Volumes
Events
Continue debugging based on the pod state.
If a pod’s status is Pending for a while, it could mean that it cannot be scheduled onto a node. Look at the describe pod output, in the Events section. Try to identify messages that indicate why the pod could not be scheduled. For example:
If a pod’s status is Waiting, this means it is scheduled on a node, but unable to run. Look at the describe pod output, in the ‘Events’ section, and try to identify reasons the pod is not able to run.
Most often, this will be due to an error when fetching the image. If so, check for the following:
If a pod is not running as expected, there can be two common causes: error in pod manifest, or mismatch between your local pod manifest and the manifest on the API server.
It is common to introduce errors into a pod description, for example by nesting sections incorrectly, or typing a command incorrectly.
Try deleting the pod and recreating it with kubectl apply --validate -f mypod1.yaml
kubectl apply --validate -f mypod1.yaml
This command will give you an error like this if you misspelled a command in the pod manifest, for example if you wrote continers instead of containers:
continers
containers
46757 schema.go:126] unknown field: continers 46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/5786 pods/mypod1
It can happen that the pod manifest, as recorded by the Kubernetes API Server, is not the same as your local manifest—hence the unexpected behavior.
Run this command to retrieve the pod manifest from the API server and save it as a local YAML file:
kubectl get pods/[pod-name] -o yaml > apiserver-[pod-name].yaml
You will now have a local file called apiserver-[pod-name].yaml, open it and compare with your local YAML. There are three possible cases:
apiserver-[pod-name].yaml
kubectl debug
If logs, events, and kubectl describe do not reveal the cause of a pod, node, or networking issue, use kubectl debug for deeper investigation. Modern Kubernetes debugging is not limited to opening a shell in a running container. You can add ephemeral containers, copy pods with modified commands or images, inspect nodes through debug pods, capture traffic, and apply debug profiles that grant the right level of access for the investigation.
Use this approach carefully in production. Debug containers can expose sensitive runtime details, and privileged profiles should be limited to trusted operators and short-lived troubleshooting sessions.
curl
ps
dig
tcpdump
kubectl debug -it <pod-name> -n <namespace> --image=busybox:1.36 --target=<container-name>
kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container-name>
kubectl exec
kubectl debug <pod-name> -n <namespace> -it --copy-to=<pod-name>-debug --container=<container-name> -- sh
kubectl debug <pod-name> -n <namespace> --copy-to=<pod-name>-debug --set-image=*=ubuntu:latest
kubectl debug node/<node-name> -it --image=ubuntu:latest
kubectl debug --profile=sysadmin pod/<pod-name> -n <namespace> -it --image=ubuntu:latest
kubectl debug -it <pod-name> -n <namespace> --image=busybox:1.36 --target=<container-name> --profile=general
Use an ephemeral container when the application container is running but does not include the tools you need. This is common with minimal, hardened, or distroless images.
For network-heavy debugging, use a tool image such as nicolaka/netshoot:
nicolaka/netshoot
Inside the debug container, you can check processes, DNS, connectivity, routes, open ports, or files mounted into the pod:
ps auxnslookup kubernetes.defaultcurl -v http://..svc.cluster.localip addrip route
The first step to troubleshooting container issues is to get basic information on the Kubernetes worker nodes and Services running on the cluster.
To see a list of worker nodes and their status, run kubectl get nodes --show-labels. The output will be something like this:
kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS worker0 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker0 worker1 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker1 worker2 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker2
To get information about Services running on the cluster, run:
kubectl cluster-info
The output will be something like this:
Kubernetes master is running at https://104.197.5.247 elasticsearch-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy kibana-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kibana-logging/proxy kube-dns is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kube-dns/proxy
To diagnose deeper issues with nodes on your cluster, you will need access to logs on the nodes. The following table explains where to find the logs.
/var/log/kube-apiserver.log
/var/log/kube-scheduler.log
/var/log/kube-controller-manager.log
/var/log/kubelet.log
/var/log/kube-proxy.log
Let’s look at several common cluster failure scenarios, their impact, and how they can typically be resolved. This is not a complete guide to cluster troubleshooting, but can help you resolve the most common issues.
Kubernetes troubleshooting is hard because incidents rarely come from one clean signal. A failed rollout, unhealthy pod, missing Secret, resource limit, node issue, policy change, or service dependency can all surface as the same user-facing problem. Engineers often have to jump between events, logs, metrics, traces, deployment history, Git changes, alerts, and ownership data before they can understand what broke and how to fix it safely.
Komodor helps teams move from manual Kubernetes troubleshooting to AI-assisted and autonomous incident response. Komodor’s AI SRE platform, powered by Klaudia, continuously detects, investigates, and helps remediate issues across cloud-native environments, reducing the time it takes to identify root cause and recover from production incidents.
Instead of giving teams another dashboard to inspect manually, Komodor correlates Kubernetes context across workloads, nodes, add-ons, CRDs, services, configs, logs, events, metrics, deployment history, and recent changes. This helps teams understand what failed, what triggered it, what else is affected, and what action to take next.
Komodor helps Kubernetes teams troubleshoot faster by providing:
For example, if a workload enters CrashLoopBackOff after a configuration change, Komodor can correlate the pod state, previous logs, deployment history, ConfigMap or Secret changes, related alerts, and service impact. Instead of manually checking each signal in isolation, the team can see the likely root cause, understand the blast radius, and apply a safer remediation path.
This is especially useful in large Kubernetes environments where one incident can cross multiple clusters, namespaces, services, and teams. By combining Kubernetes visibility with Klaudia’s AI-powered investigation and remediation, Komodor helps platform, DevOps, and SRE teams reduce MTTR, prevent repeat incidents, and keep production systems more reliable.
Want to see how AI SRE changes Kubernetes troubleshooting? Try Komodor or test drive Klaudia to see how automatic detection, root cause analysis, and remediation work in a real Kubernetes environment.
The first step in Kubernetes troubleshooting is to identify the scope of the issue. Check whether the problem affects one pod, one deployment, one node, one namespace, one service, or the entire cluster. Start with kubectl get pods -n , kubectl get nodes, and kubectl get events -n --sort-by=.lastTimestamp to quickly see pod status, node health, and recent warning events.
kubectl get pods -n
kubectl get events -n --sort-by=.lastTimestamp
To troubleshoot a Kubernetes pod, start by checking its status with kubectl get pods -n . Then describe the pod with kubectl describe pod -n to review events, container state, restart count, scheduling issues, image pull errors, volume mount errors, and probe failures. If the pod is running or restarting, check logs with kubectl logs -n . For restarted containers, use kubectl logs -n --previous.
kubectl describe pod -n
kubectl logs -n
kubectl logs -n --previous
To troubleshoot Kubernetes networking, first check whether the issue is happening between pods, through a Service, through Ingress, or outside the cluster. Verify that the Service selector matches the pod labels, then check endpoints with kubectl get endpointslice -n or kubectl get endpoints -n . Test DNS and connectivity from inside the cluster using a temporary debug pod. Also review NetworkPolicies, ingress controller logs, CoreDNS health, service ports, targetPort values, and whether the destination pods are ready.
kubectl get endpointslice -n
kubectl get endpoints -n
targetPort
AI SRE tools help Kubernetes troubleshooting by correlating signals that engineers usually have to inspect manually, such as events, logs, metrics, traces, deployment history, configuration changes, Git changes, alerts, and ownership data. Instead of jumping between disconnected tools, teams can use AI SRE to detect issues faster, identify likely root causes, understand blast radius, recommend safe remediation steps, and reduce MTTR. In Kubernetes environments, this is especially useful because incidents often span pods, services, nodes, configs, dependencies, and teams.
Share:
Gain instant visibility into your clusters and resolve issues faster.
May 12 · 9:00EST / 15:00 CET · Live & Online
🎯 8+ Sessions 🎙️ 10+ Speakers ⚡ 100% Free
By registering you agree to our Privacy Policy. No spam. Unsubscribe anytime.
Check your inbox for a confirmation. We'll send session links closer to May 12.