Kubernetes Troubleshooting – The Complete Guide

What is Kubernetes Troubleshooting?

Kubernetes troubleshooting is the process of identifying, diagnosing, and resolving issues in Kubernetes clusters, nodes, pods, or containers.

More broadly defined, Kubernetes troubleshooting also includes effective ongoing management of faults and taking measures to prevent issues in Kubernetes components.

This article will focus on:

  • Providing solutions to common errors, including CreateContainerConfigError, ImagePullBackOff, CrashLoopBackOff and Kubernetes Node Not Ready.
  • Explaining initial diagnosis of problems in Kubernetes pods and clusters.
  • Showing where to find logs and other information required for deeper analysis.

Kubernetes Troubleshooting Cheat Sheet: Symptoms, Checks, and Safe Fixes

Use this Kubernetes troubleshooting cheat sheet when something breaks and you need to quickly narrow down where the problem is: the application, pod, node, service, network, storage, or control plane. Start with the symptom, identify the most likely layer, run the first diagnostic command, and apply the safest fix only after checking events, logs, and recent changes.

SymptomLikely layerWhat to checkSafe fix
Pod is stuck in PendingScheduling / cluster capacityLook at Events for FailedScheduling, insufficient CPU/memory, taints, affinity rules, node selectors, or quota limits.Adjust requests, add capacity, fix taints/tolerations, relax affinity rules, or update quotas.
Pod shows CrashLoopBackOffApplication / container runtimeCheck previous container logs, exit codes, failed probes, missing env vars, bad commands, or app startup errors.Roll back the latest change, fix the app/config issue, tune probes, or correct the container command.
Pod shows ImagePullBackOff or ErrImagePullImage registry / credentialsLook for wrong image name, wrong tag, private registry auth failure, missing imagePullSecret, or registry outage.Correct the image reference, restore registry access, or fix the pull secret.
Pod shows CreateContainerConfigErrorConfig / Secret / volume referenceCheck for missing ConfigMaps, Secrets, environment variable references, or invalid volume mounts.Create or restore the missing object, fix the manifest reference, then redeploy through the normal pipeline.
Pod is OOMKilled or exits with code 137Memory / resource limitsCheck memory limits, usage spikes, recent deployments, and whether the app has a memory leak.Increase memory limits carefully, right-size requests, optimize the app, or roll back the change that caused the spike.
Pod is Running but not receiving trafficService / readiness / endpointsCheck Service selectors, ready endpoints, readiness probes, ports, targetPorts, and labels.Fix labels/selectors, correct port mappings, repair readiness probes, or restart only the affected workload.
Service returns 502, 503, or timeout errorsService / ingress / networkCheck ingress rules, backend Service, endpoints, controller logs, NetworkPolicy, and upstream pod health.Fix routing rules, restore healthy endpoints, correct NetworkPolicy, or roll back ingress/service changes.
DNS resolution fails inside podsCluster DNS / CoreDNSCheck CoreDNS pod health, DNS config, network policy, and whether failures affect one namespace or the whole cluster.Restart unhealthy CoreDNS pods, fix DNS config, or remove the policy blocking DNS traffic.
Node shows NotReadyNode / kubelet / infrastructureCheck kubelet status, disk pressure, memory pressure, network availability, container runtime, and CNI health.Cordon the node, drain only when safe, fix kubelet/runtime/network issues, or replace the node.
Pods are evictedNode pressure / resource exhaustionLook for disk pressure, memory pressure, ephemeral storage usage, and eviction messages.Free node resources, adjust requests/limits, reduce ephemeral storage usage, or scale capacity.
PVC will not mountStorage / CSI / permissionsCheck PVC status, StorageClass, CSI driver events, access modes, zone mismatch, and volume attachment errors.Fix StorageClass/PVC settings, restore CSI health, correct access mode mismatch, or reschedule in the right zone.
kubectl cannot reach the clusterAPI server / kubeconfig / networkCheck kubeconfig context, credentials, VPN, firewall, API server health, and cloud provider auth.Switch context, refresh credentials, restore network access, or check managed Kubernetes control plane status.
Rollout is stuckDeployment / ReplicaSet / probesCheck new ReplicaSet, unavailable pods, failed probes, image pulls, resource limits, and recent deployment changes.Pause or roll back the rollout, fix the failing manifest, then redeploy safely.
Cluster is slow or unstableControl plane / API server / etcd / loadCheck API server readiness, scheduler/controller health, request latency, node count, and recent cluster-wide changes.Reduce noisy workloads, investigate control plane metrics, roll back risky changes, or escalate to the managed Kubernetes provider.
Kubernetes Troubleshooting Cheat Sheet for Common Cluster Issues

The Three Pillars of Kubernetes Troubleshooting

Effective Kubernetes troubleshooting depends on three things: understanding the issue, managing the incident safely, and preventing the same problem from recurring. Instead of starting with a long list of tools, start with the signals each tool or system gives you during an incident.

A Kubernetes incident usually leaves clues across events, logs, metrics, traces, deployments, Git changes, configuration, alerts, and ownership data. The faster you connect those signals, the faster you can move from “something is broken” to “this changed, this failed, and this is the safest fix.”

1. Understanding: collect the right troubleshooting signals

The first goal is to understand where the problem is happening and what changed before it started. In Kubernetes, the same symptom can come from the application, pod, node, service, network, storage layer, control plane, or a recent deployment.

Use the signals below to narrow the blast radius and identify the likely cause.

SignalWhat it reveals during an incidentExamples of what to check
Kubernetes eventsScheduling failures, image pull errors, probe failures, volume mount issues, node pressure, and object-level warnings.kubectl describe pod <pod-name>, kubectl get events -n <namespace> --sort-by=.lastTimestamp
Pod and container logsApplication errors, startup failures, dependency issues, unhandled exceptions, failed migrations, and crash reasons.kubectl logs <pod-name> -n <namespace>, kubectl logs <pod-name> --previous
MetricsCPU pressure, memory spikes, OOM kills, latency, saturation, throttling, node pressure, and resource exhaustion.CPU/memory usage, restart count, request latency, node disk pressure, network errors
TracesWhere requests slow down or fail across services, APIs, databases, queues, or external dependencies.Failed spans, slow downstream calls, timeout patterns, service-to-service latency
Deployment historyWhether the issue started after a release, rollout, image change, config update, or scaling event.kubectl rollout history deployment/<name>, recent Helm/Argo/CD changes
Git changesThe exact code, manifest, Helm value, policy, or infrastructure change that introduced the issue.Recent commits, pull requests, merged deployment changes, IaC updates
ConfigurationMissing or incorrect ConfigMaps, Secrets, environment variables, probes, resource limits, service selectors, and volume mounts.Deployment YAML, ConfigMaps, Secrets, probe settings, labels, selectors
AlertsThe first symptom detected and whether the issue is user-facing, infrastructure-level, or isolated to one workload.Alert timing, affected service, alert source, severity, repeated firing patterns
Ownership dataWhich team owns the service, deployment, namespace, cluster, or dependency involved in the incident.Service catalog, namespace owner, deployment labels, escalation path
Signals below to narrow the blast radius

This signal-based view helps teams avoid guessing. For example, a CrashLoopBackOff should usually start with previous container logs and recent deployment changes, while a Pending pod should start with events, scheduler messages, node capacity, taints, affinity, and quotas.

2. Management: connect signals to safe actions

Once you understand the likely cause, the next step is to manage the incident without making the blast radius worse. The goal is not to apply the fastest possible command. The goal is to choose the safest action based on the evidence.

Use this troubleshooting workflow during active incidents:

Incident stepSignal to useSafe action
Confirm the blast radiusAlerts, metrics, affected namespaces, affected servicesDetermine whether the issue affects one pod, one workload, one node, one namespace, or the whole cluster.
Check what changedDeployment history, Git changes, config changes, ownership dataIdentify recent releases, image changes, manifest updates, policy changes, scaling events, or infrastructure changes.
Inspect the failing objectKubernetes events, pod status, container state, logsRun kubectl describe, check events, check current logs, and check previous logs for restarted containers.
Validate resource healthMetrics, node conditions, restart counts, OOM eventsCheck CPU, memory, ephemeral storage, disk pressure, throttling, and node availability.
Verify traffic flowServices, EndpointSlices, ingress, DNS, tracesConfirm selectors, ready endpoints, ports, targetPorts, ingress rules, DNS resolution, and downstream dependencies.
Apply the lowest-risk mitigationRunbook, rollback data, ownership dataRoll back a bad deployment, restore missing config, scale carefully, cordon a bad node, or route traffic away from the failing component.
Confirm recoveryAlerts, metrics, logs, traces, user-facing checksMake sure the symptom clears, error rates fall, latency recovers, and the workload remains stable.
Troubleshooting workflow during active incidents

For example, if a service starts returning 503 errors after a deployment, do not start by restarting random pods. Check the deployment history, Service selectors, EndpointSlices, readiness probes, pod events, and ingress/controller logs first. If the new rollout removed all ready endpoints, the safest fix may be if a service starts returning 503 errors after a deployment, do not start by restarting random pods. Check the deployment history, Service selectors, EndpointSlices, readiness probes, pod events, and a rollback or a readiness probe correction, not a cluster-wide restart.

3. Prevention: turn incident signals into durable fixes

After the incident is resolved, use the same signals to prevent recurrence. The best Kubernetes teams do not only ask “what fixed it?” They ask “what signal would have shown this earlier, who needed to see it, and how do we make the fix permanent?”

Use this prevention map after each incident:

Prevention areaSignal to preserve or improveDurable fix
RunbooksThe commands, checks, and safe actions that workedUpdate the runbook with the exact symptom, diagnostic steps, owner, rollback path, and verification checks.
AlertsThe signal that detected the problem, or failed to detect it early enoughTune alert thresholds, add missing alerts, reduce noisy alerts, and connect alerts to service ownership.
Deployment safetyThe release, config, or infrastructure change that caused the incidentAdd rollout checks, canary steps, policy validation, manifest linting, or automated rollback triggers.
Configuration qualityThe ConfigMap, Secret, probe, selector, limit, or policy that failedMove the fix into Git, validate it in CI/CD, and prevent manual-only production changes.
ObservabilityThe missing log, metric, trace, or event context that slowed diagnosisAdd better instrumentation, labels, dashboards, trace coverage, or log fields.
Ownership and escalationThe team or service boundary that slowed responseAdd service owners, namespace labels, escalation paths, and dependency maps.
AutomationThe repeated manual step that could be handled safely by automationAutomate low-risk actions such as enrichment, diagnosis, ticket routing, rollback suggestions, or known safe remediations.
Prevention map

This makes troubleshooting more repeatable. Events show what Kubernetes observed, logs show what the application experienced, metrics show resource and performance impact, traces show request flow, deployments and Git changes show what changed, configs show what the workload expected, alerts show how the issue surfaced, and ownership data shows who can fix it.

 
expert-icon-header

Tips from the expert

Itiel Shwartz

Co-Founder & CTO

Itiel is the CTO and co-founder of Komodor. He’s a big believer in dev empowerment and moving fast, has worked at eBay, Forter and Rookout (as the founding engineer). Itiel is a backend and infra developer turned “DevOps”, an avid public speaker that loves talking about things such as cloud infrastructure, Kubernetes, Python, observability, and R&D culture.

In my experience, here are tips that can help you better troubleshoot Kubernetes:

Use namespaces for isolation

Organize your clusters by namespaces to isolate different environments (e.g., dev, staging, prod) for more targeted troubleshooting.

Enable audit logging

Turn on Kubernetes audit logging to capture detailed events for security and debugging purposes.

Utilize resource quotas

Set resource quotas to prevent a single application from exhausting cluster resources, which can lead to easier identification of resource contention issues.

Automate node health checks

Implement automated health checks and alerts for node conditions such as disk pressure, memory pressure, and network availability.

Employ Kubernetes Operators

Use operators to automate complex application lifecycles and maintain stateful applications more reliably.

Why is Kubernetes Troubleshooting so Difficult?

Kubernetes is a complex system, and troubleshooting issues that occur somewhere in a Kubernetes cluster is just as complicated.

Even in a small, local Kubernetes cluster, it can be difficult to diagnose and resolve issues, because an issue can represent a problem in an individual container, in one or more pods, in a controller, a control plane component, or more than one of these.

In a large-scale production environment, these issues are exacerbated, due to the low level of visibility and a large number of moving parts. Teams must use multiple tools to gather the data required for troubleshooting and may have to use additional tools to diagnose issues they detect and resolve them.

To make matters worse, Kubernetes is often used to build microservices applications, in which each microservice is developed by a separate team. In other cases, there are DevOps and application development teams collaborating on the same Kubernetes cluster. This creates a lack of clarity about division of responsibility – if there is a problem with a pod, is that a DevOps problem, or something to be resolved by the relevant application team?

In short – Kubernetes troubleshooting can quickly become a mess, waste major resources and impact users and application functionality – unless teams closely coordinate and have the right tools available.

Troubleshooting Common Kubernetes Errors

If you are experiencing one of these common Kubernetes errors, here’s a quick guide to identifying and resolving the problem:

CreateContainerConfigError

This error is usually the result of a missing Secret or ConfigMap. Secrets are Kubernetes objects used to store sensitive information like database credentials. ConfigMaps store data as key-value pairs, and are typically used to hold configuration information used by multiple pods.

How to identify the issue

Run kubectl get pods .

Check the output to see if the pod’s status is CreateContainerConfigError

$ kubectl get pods  NAME                 READY   STATUS                       RESTARTS   AGE pod-missing-config   0/1     CreateContainerConfigError   0          1m23s

Getting detailed information and resolving the issue

To get more information about the issue, run kubectl describe [name] and look for a message indicating which ConfigMap is missing:

$ kubectl describe pod pod-missing-config  Warning Failed 34s (x6 over 1m45s) kubelet  Error: configmap "configmap-3" not found

Now run this command to see if the ConfigMap exists in the cluster.

For example $ kubectl get configmap configmap-3

If the result is null, the ConfigMap is missing, and you need to create it. See the documentation to learn how to create a ConfigMap with the name requested by your pod.

Make sure the ConfigMap is available by running get configmap [name] again. If you want to view the content of the ConfigMap in YAML format, add the flag -o yaml.

Once you have verified the ConfigMap exists, run kubectl get pods again, and verify the pod is in status Running:

$ kubectl get pods NAME                 READY   STATUS    RESTARTS   AGE pod-missing-config   0/1     Running   0          1m23s 

ImagePullBackOff or ErrImagePull

This status means that a pod could not run because it attempted to pull a container image from a registry, and failed. The pod refuses to start because it cannot create one or more containers defined in its manifest.

How to identify the issue

Run the command kubectl get pods

Check the output to see if the pod status is ImagePullBackOff or ErrImagePull:

$ kubectl get pods NAME       READY    STATUS             RESTARTS   AGE mypod-1    0/1      ImagePullBackOff   0          58s 

Getting detailed information and resolving the issue

Run the kubectl describe pod [name] command for the problematic pod.

The output of this command will indicate the root cause of the issue. This can be one of the following:

  • Wrong image name or tag—this typically happens because the image name or tag was typed incorrectly in the pod manifest. Verify the correct image name using docker pull, and correct it in the pod manifest.
  • Authentication issue in Container registry—the pod could not authenticate with the registry to retrieve the image. This could happen because of an issue in the Secret holding credentials, or because the pod does not have an RBAC role that allows it to perform the operation. Ensure the pod and node have the appropriate permissions and Secrets, then try the operation manually using docker pull.

CrashLoopBackOff

CrashLoopBackOff means a container started, crashed, and Kubernetes is delaying the next restart attempt because the container keeps failing repeatedly. It does not usually mean the pod cannot be scheduled. Scheduling problems typically show up as Pending or FailedScheduling.

Kubernetes restarts failed containers according to the pod’s restart policy. When the same container crashes again and again, Kubernetes applies an exponential backoff delay between restart attempts to avoid restarting the failing container too aggressively.

How to identify the issue

Run:

kubectl get pods -n <namespace>

Check whether the affected pod shows CrashLoopBackOff and whether the RESTARTS count keeps increasing.

Kubernetes Node Not Ready

When a worker node shuts down or crashes, all stateful pods that reside on it become unavailable, and the node status appears as NotReady.

If a node has a NotReady status for over five minutes (by default), Kubernetes changes the status of pods scheduled on it to Unknown, and attempts to schedule it on another node, with status ContainerCreating.

How to identify the issue

Run the command kubectl get nodes.

Check the output to see is the node status is NotReady

NAME        STATUS      AGE    VERSION mynode-1    NotReady    1h     v1.2.0 

To check if pods scheduled on your node are being moved to other nodes, run the command get pods.

Check the output to see if a pod appears twice on two different nodes, as follows:

NAME       READY    STATUS               RESTARTS      AGE    IP        NODE mypod-1    1/1      Unknown              0             10m    [IP]      mynode-1 mypod-1    0/1      ContainerCreating    0             15s    [none]    mynode-2 

Resolving the issue

If the failed node is able to recover or is rebooted by the user, the issue will resolve itself. Once the failed node recovers and joins the cluster, the following process takes place:

  1. The pod with Unknown status is deleted, and volumes are detached from the failed node.
  2. The pod is rescheduled on the new node, its status changes from Unknown to ContainerCreating and required volumes are attached.
  3. Kubernetes uses a five-minute timeout (by default), after which the pod will run on the node, and its status changes from ContainerCreating to Running.

If you have no time to wait, or the node does not recover, you’ll need to help Kubernetes reschedule the stateful pods on another, working node. There are two ways to achieve this:

  • Remove failed node from the cluster—using the command kubectl delete node [name]
  • Delete stateful pods with status unknown—using the command kubectl delete pods [pod_name] --grace-period=0 --force -n [namespace]

Learn more about Node Not Ready issues in Kubernetes.

Troubleshooting Kubernetes Pods: A Quick Guide

If you’re experiencing an issue with a Kubernetes pod, and you couldn’t find and quickly resolve the error in the section above, here is how to dig a bit deeper. The first step to diagnosing pod issues is running kubectl describe pod [name].

Understanding the Output of the kubectl describe pod Command

Here is example output of the describe pod command, provided in the Kubernetes documentation:

Name:		nginx-deployment-1006230814-6winp Namespace:	default Node:		kubernetes-node-wul5/10.240.0.9 Start Time:	Thu, 24 Mar 2016 01:39:49 +0000 Labels:		app=nginx,pod-template-hash=1006230814 Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16" ... Status:		Running IP:		10.244.0.6 Controllers:	ReplicaSet/nginx-deployment-1006230814 Containers:   nginx:     Container ID:	docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149     Image:		nginx     Image ID:		docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707     Port:		80/TCP     QoS Tier:       cpu:	Guaranteed       memory:	Guaranteed     Limits:       cpu:	500m       memory:	128Mi     Requests:       memory:		128Mi       cpu:		500m     State:		Running       Started:		Thu, 24 Mar 2016 01:39:51 +0000     Ready:		True     Restart Count:	0     Environment:        [none]     Mounts:       /var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro) Conditions:   Type          Status   Initialized   True   Ready         True   PodScheduled  True Volumes:   default-token-4bcbi:     Type:	Secret (a volume populated by a Secret)     SecretName:	default-token-4bcbi     Optional:   false QoS Class:      Guaranteed Node-Selectors: [none] Tolerations:    [none] Events:   FirstSeen	LastSeen	Count	From					SubobjectPath		Type		Reason		Message   ---------	--------	-----	----					-------------		--------	------		-------   54s		54s		1	{default-scheduler }						Normal		Scheduled	Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5   54s		54s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Pulling		pulling image "nginx"   53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Pulled		Successfully pulled image "nginx"   53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Created		Created container with docker id 90315cc9f513   53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Started		Started container with docker id 90315cc9f513 

We bolded the most important sections in the describe pod output:

  • Name—below this line are basic data about the pod, such as the node it is running on, its labels and current status.
  • Status—this is the current state of the pod, which can be:
    • Pending
    • Running
    • Succeeded
    • Failed
    • Unknown
  • Containers—below this line is data about containers running on the pod (only one in this example, called nginx),
  • Containers:State—this indicates the status of the container, which can be:
    • Waiting
    • Running
    • Terminated
  • Volumes—storage volumes, secrets or ConfigMaps mounted by containers in the pod.
  • Events—recent events occurring on the pod, such as images pulled, containers created and containers started.

Continue debugging based on the pod state.

Pod Stays Pending

If a pod’s status is Pending for a while, it could mean that it cannot be scheduled onto a node. Look at the describe pod output, in the Events section. Try to identify messages that indicate why the pod could not be scheduled. For example:

  • Insufficient resources in the cluster—the cluster may have insufficient CPU or memory resources. This means you’ll need to delete some pods, add resources on your nodes, or add more nodes.
  • Resource requirements—the pod may be difficult to schedule due to specific resources requirements. See if you can release some of the requirements to make the pod eligible for scheduling on additional nodes.

Pod Stays Waiting

If a pod’s status is Waiting, this means it is scheduled on a node, but unable to run. Look at the describe pod output, in the ‘Events’ section, and try to identify reasons the pod is not able to run.

Most often, this will be due to an error when fetching the image. If so, check for the following:

  • Image name—ensure the image name in the pod manifest is correct
  • Image available—ensure the image is really available in the repository
  • Test manually—run a docker pull command on the local machine, ensuring you have the appropriate permissions, to see if you can retrieve the image

Pod Is Running but Misbehaving

If a pod is not running as expected, there can be two common causes: error in pod manifest, or mismatch between your local pod manifest and the manifest on the API server.

Checking for an error in your pod description

It is common to introduce errors into a pod description, for example by nesting sections incorrectly, or typing a command incorrectly.

Try deleting the pod and recreating it with kubectl apply --validate -f mypod1.yaml

This command will give you an error like this if you misspelled a command in the pod manifest, for example if you wrote continers instead of containers:

46757 schema.go:126] unknown field: continers 46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/5786 pods/mypod1 

Checking for a mismatch between local pod manifest and API Server

It can happen that the pod manifest, as recorded by the Kubernetes API Server, is not the same as your local manifest—hence the unexpected behavior.

Run this command to retrieve the pod manifest from the API server and save it as a local YAML file:

kubectl get pods/[pod-name] -o yaml > apiserver-[pod-name].yaml

You will now have a local file called apiserver-[pod-name].yaml, open it and compare with your local YAML. There are three possible cases:

  • Local YAML has the same lines as API Server YAML, and more—this indicates a mismatch. Delete the pod and rerun it with the local pod manifest (assuming it is the correct one).
  • API Server YAML has the same lines as local YAML, and more—this is normal, because the API Server can add more lines to the pod manifest over time. The problem lies elsewhere.
  • Both YAML files are identical—again, this is normal, and means the problem lies elsewhere.

Advanced Kubernetes Debugging with kubectl debug

If logs, events, and kubectl describe do not reveal the cause of a pod, node, or networking issue, use kubectl debug for deeper investigation. Modern Kubernetes debugging is not limited to opening a shell in a running container. You can add ephemeral containers, copy pods with modified commands or images, inspect nodes through debug pods, capture traffic, and apply debug profiles that grant the right level of access for the investigation.

Use this approach carefully in production. Debug containers can expose sensitive runtime details, and privileged profiles should be limited to trusted operators and short-lived troubleshooting sessions.

When to use kubectl debug

Debugging needBest approachExample command
The container is running but lacks tools like curl, ps, dig, or tcpdumpAdd an ephemeral debug containerkubectl debug -it <pod-name> -n <namespace> --image=busybox:1.36 --target=<container-name>
The container image is distroless or does not include a shellAdd an ephemeral debug container with a tool imagekubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container-name>
The app crashes too quickly to inspect with kubectl execCopy the pod and change the commandkubectl debug <pod-name> -n <namespace> -it --copy-to=<pod-name>-debug --container=<container-name> -- sh
You need a different image with more debugging toolsCopy the pod and change the container imagekubectl debug <pod-name> -n <namespace> --copy-to=<pod-name>-debug --set-image=*=ubuntu:latest
You need to inspect node-level networking, processes, or filesystem pathsCreate a debug pod on the nodekubectl debug node/<node-name> -it --image=ubuntu:latest
You need packet capture for a pod or node issueUse a debug profile and capture traffickubectl debug --profile=sysadmin pod/<pod-name> -n <namespace> -it --image=ubuntu:latest
You need specific privileges for debuggingApply a static or custom debug profilekubectl debug -it <pod-name> -n <namespace> --image=busybox:1.36 --target=<container-name> --profile=general
When to use kubectl debug

Debug a running pod with an ephemeral container

Use an ephemeral container when the application container is running but does not include the tools you need. This is common with minimal, hardened, or distroless images.

Run:

kubectl debug -it <pod-name> -n <namespace> --image=busybox:1.36 --target=<container-name>

For network-heavy debugging, use a tool image such as nicolaka/netshoot:

kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container-name>

Inside the debug container, you can check processes, DNS, connectivity, routes, open ports, or files mounted into the pod:

ps aux
nslookup kubernetes.default
curl -v http://..svc.cluster.local
ip addr
ip route

Troubleshooting Kubernetes Clusters: A Quick Guide

Viewing Basic Cluster Info

The first step to troubleshooting container issues is to get basic information on the Kubernetes worker nodes and Services running on the cluster.

To see a list of worker nodes and their status, run kubectl get nodes --show-labels. The output will be something like this:

NAME      STATUS    ROLES    AGE     VERSION        LABELS worker0   Ready     [none]   1d      v1.13.0        ...,kubernetes.io/hostname=worker0 worker1   Ready     [none]   1d      v1.13.0        ...,kubernetes.io/hostname=worker1 worker2   Ready     [none]   1d      v1.13.0        ...,kubernetes.io/hostname=worker2 

To get information about Services running on the cluster, run:

kubectl cluster-info

The output will be something like this:

Kubernetes master is running at https://104.197.5.247 elasticsearch-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy kibana-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kibana-logging/proxy kube-dns is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kube-dns/proxy 

Retrieving Cluster Logs

To diagnose deeper issues with nodes on your cluster, you will need access to logs on the nodes. The following table explains where to find the logs.

Node Type Component Where to Find Logs
Master API Server /var/log/kube-apiserver.log
Master Scheduler /var/log/kube-scheduler.log
Master Controller Manager /var/log/kube-controller-manager.log
Worker Kubelet /var/log/kubelet.log
Worker Kube Proxy /var/log/kube-proxy.log

Common Cluster Failure Scenarios and How to Resolve Them

Let’s look at several common cluster failure scenarios, their impact, and how they can typically be resolved. This is not a complete guide to cluster troubleshooting, but can help you resolve the most common issues.

API Server VM Shuts Down or Crashes

  • Impact: If the API server is down, you will not be able to start, stop, or update pods and services.
  • Resolution: Restart the API server VM.
  • Prevention: Set the API server VM to automatically restart, and set up high availability for the API server.

Control Plane Service Shuts Down or Crashes

  • Impact: Services like the Replication Controller Manager, Scheduler, and so on are collocated with the API Server, so if any of them shut down or crashes, the impact is the same as shutdown of the API Server.
  • Resolution: Same as API Server VM Shuts Down.
  • Prevention: Same as API Server VM Shuts Down.

API Server Storage Lost

  • Impact: API Server will fail to restart after shutting down.
  • Resolution: Ensure storage is working again, manually recover the state of the API Server from backup, and restart it.
  • Prevention: Ensure you have a readily available snapshot of the API Server. Use reliable storage, such as Amazon Elastic Block Storage (EBS), which survives shut down of the API Server VM, and prefer highly available storage.

Worker Node Shuts Down

  • Impact: Pods on the node stop running, the Scheduler will attempt to run them on other available nodes. The cluster will now have less overall capacity to run pods.
  • Resolution: Identify the issue on the node, bring it back up and register it with the cluster.
  • Prevention: Use a replication control or a Service in front of pods, to ensure users are not impacted by node failures. Design applications to be fault tolerant.

Kubelet Malfunction

  • Impact: If the kubelet crashes on a node, you will not be able to start new pods on that node. Existing pods may or may not be deleted, and the node will be marked unhealthy.
  • Resolution: Same as Worker Node Shuts Down.
  • Prevention: Same as Worker Node Shuts Down.

Unplanned Network Partitioning Disconnecting Some Nodes from the Master

  • Impact: The master nodes think that nodes in the other network partition are down, and those nodes cannot communicate with the API Server.
  • Resolution: Reconfigure the network to enable communication between all nodes and the API Server.
  • Prevention: Use a networking solution that can automatically reconfigure cluster network parameters.

Human Error by Cluster Operator

  • Impact: An accidental command by a human operator, or misconfigured Kubernetes components, can cause loss of pods, services, or control plane components. This can result in disruption of service to some or all nodes.
  • Resolution: Most cluster operator errors can be resolved by restoring the API Server state from backup.
  • Prevention: Implement a solution to automatically review and correct configuration errors in your Kubernetes clusters.

Kubernetes Troubleshooting with Komodor’s AI SRE Platform

Kubernetes troubleshooting is hard because incidents rarely come from one clean signal. A failed rollout, unhealthy pod, missing Secret, resource limit, node issue, policy change, or service dependency can all surface as the same user-facing problem. Engineers often have to jump between events, logs, metrics, traces, deployment history, Git changes, alerts, and ownership data before they can understand what broke and how to fix it safely.

Komodor helps teams move from manual Kubernetes troubleshooting to AI-assisted and autonomous incident response. Komodor’s AI SRE platform, powered by Klaudia, continuously detects, investigates, and helps remediate issues across cloud-native environments, reducing the time it takes to identify root cause and recover from production incidents.

Instead of giving teams another dashboard to inspect manually, Komodor correlates Kubernetes context across workloads, nodes, add-ons, CRDs, services, configs, logs, events, metrics, deployment history, and recent changes. This helps teams understand what failed, what triggered it, what else is affected, and what action to take next.

Komodor helps Kubernetes teams troubleshoot faster by providing:

CapabilityHow it helps during Kubernetes troubleshooting
Automatic issue detectionSurfaces real-time Kubernetes issues across workloads, clusters, services, nodes, add-ons, and dependencies before teams waste time stitching signals together manually.
AI-driven root cause analysisKlaudia analyzes events, logs, configurations, metrics, deployment history, and change context to explain what failed, what triggered it, and why it matters.
Change and drift contextConnects incidents to recent deployments, config changes, Git changes, policy changes, dependency changes, and Kubernetes drift so teams can identify the most likely cause faster.
Cascading failure analysisShows how one failure affects connected services and dependencies, helping teams separate the root cause from downstream symptoms.
Actionable remediation guidanceProvides recommended next steps, safe fixes, and context-aware troubleshooting guidance instead of leaving engineers to guess from raw logs and alerts.
One-click and autonomous remediationHelps teams move from diagnosis to recovery with guided fixes, one-click actions, and autonomous self-healing for trusted remediation workflows.
MTTR reductionShortens the path from alert to root cause to fix by combining detection, investigation, RCA, and remediation in one Kubernetes-focused workflow.
Team and ownership contextHelps route incidents to the right service owner, team, or escalation path with the context needed to act quickly.

For example, if a workload enters CrashLoopBackOff after a configuration change, Komodor can correlate the pod state, previous logs, deployment history, ConfigMap or Secret changes, related alerts, and service impact. Instead of manually checking each signal in isolation, the team can see the likely root cause, understand the blast radius, and apply a safer remediation path.

This is especially useful in large Kubernetes environments where one incident can cross multiple clusters, namespaces, services, and teams. By combining Kubernetes visibility with Klaudia’s AI-powered investigation and remediation, Komodor helps platform, DevOps, and SRE teams reduce MTTR, prevent repeat incidents, and keep production systems more reliable.

Want to see how AI SRE changes Kubernetes troubleshooting? Try Komodor or test drive Klaudia to see how automatic detection, root cause analysis, and remediation work in a real Kubernetes environment.

FAQs About Kubernetes Troubleshooting

The first step in Kubernetes troubleshooting is to identify the scope of the issue. Check whether the problem affects one pod, one deployment, one node, one namespace, one service, or the entire cluster. Start with kubectl get pods -n , kubectl get nodes, and kubectl get events -n --sort-by=.lastTimestamp to quickly see pod status, node health, and recent warning events.

To troubleshoot a Kubernetes pod, start by checking its status with kubectl get pods -n . Then describe the pod with kubectl describe pod -n to review events, container state, restart count, scheduling issues, image pull errors, volume mount errors, and probe failures. If the pod is running or restarting, check logs with kubectl logs -n . For restarted containers, use kubectl logs -n --previous.

To troubleshoot Kubernetes networking, first check whether the issue is happening between pods, through a Service, through Ingress, or outside the cluster. Verify that the Service selector matches the pod labels, then check endpoints with kubectl get endpointslice -n or kubectl get endpoints -n . Test DNS and connectivity from inside the cluster using a temporary debug pod. Also review NetworkPolicies, ingress controller logs, CoreDNS health, service ports, targetPort values, and whether the destination pods are ready.

AI SRE tools help Kubernetes troubleshooting by correlating signals that engineers usually have to inspect manually, such as events, logs, metrics, traces, deployment history, configuration changes, Git changes, alerts, and ownership data. Instead of jumping between disconnected tools, teams can use AI SRE to detect issues faster, identify likely root causes, understand blast radius, recommend safe remediation steps, and reduce MTTR. In Kubernetes environments, this is especially useful because incidents often span pods, services, nodes, configs, dependencies, and teams.