What is Kubernetes Service 503 (Service Unavailable)
The 503 Service Unavailable error is an HTTP status code that indicates the server is temporarily unavailable and cannot serve the client request. In a web server, this means the server is overloaded or undergoing maintenance. In Kubernetes, it means a Service tried to route a request to a pod, but something went wrong along the way:
- The Service could not find any pods matching its selector.
- The Service found some pods matching the selector, but none of them were Running.
- Pods are running but were removed from the Service endpoint because they did not pass the readiness probe.
- Some other networking or configuration issue prevented the Service from connecting with the pods.
503 errors are a severe issue that can result in disruption of service for users. Below we’ll show a procedure for troubleshooting these errors, and some tips for avoiding 503 service errors in the first place.
Keep in mind that it can be difficult to diagnose and resolve Service 503 messages in Kubernetes, because they can involve one or more moving parts in your Kubernetes cluster. It may be difficult to identify and resolve the root cause without proper tooling.
Troubleshooting Kubernetes Service 503 Errors
Step 1: Check if the Pod Label Matches the Service Selector
A possible cause of 503 errors is that a Kubernetes pod does not have the expected label, and the Service selector does not identify it. If the Service does not find any matching pod, requests will return a 503 error.
Run the following command to see the current selector:
kubectl describe service [service-name] -n [namespace-name]
The Selector volume shows which label or labels are used to match the Service with pods.
Check if there are pods with this label:
kubectl get pods -n your_namespace -l "[label]"
- If you get the message
no resources found—this means the Service cannot discover any pods, and clients will get an HTTP 503 error. Add the label to some pods to resolve the problem.
- If there are pods with the required label—continue to the next step.
Step 2: Verify that Pods Defined for the Service are Running
In step 1 we checked which label the Service selector is using. Run the following command to ensure the pods matched by the selector are in
kubectl -n your_namespace get pods -l "[label]"
The output will look like this:
NAME READY STATUS RESTARTS AGE
my-pod-9ab66e7ee8-23978 0/1 ImagePullBackOff 0 5m10s
Step 3: Check Pods Pass the Readiness Probe for the Deployment
Next, we’ll check if a readiness probe is configured for the pod:
kubectl describe pod -n | grep -i readiness
The output will look like this:
Readiness: tcp-socket :8080 delay=10s timeout=1s period=2s #success=1 #failure=3
Warning Unhealthy 2m13s (x298 over 12m) kubelet
Readiness probe failed:
- If the output indicates that the readiness probe failed—understand why the probe is failing and resolve the issue. Refer to our guide to Kubernetes readiness probes.
- If there is no readiness probe or it succeeded—proceed to the next step.
Step 4: Verify that Instances are Registered with Load Balancer
If all the above steps did not discover a problem, another common cause of 503 errors is that no instances are registered with the load balancer. Check the following:
- Security groups—ensure that worker nodes running the relevant pods have an inbound rule that allows port access, and that nothing is blocking network traffic on the relevant port ranges.
- Availability zones—if your Kubernetes cluster is running in a public cloud, make sure that there are worker nodes in every availability zone specified by the subnets.
This procedure will help you discover the most basic issues that can result in a Service 503 error. If you didn’t manage to quickly identify the root cause, you will need a more in-depth investigation across multiple components in the Kubernetes deployment.
To complicate matters, more than one component might be malfunctioning (for example, both the pod and the Service), making diagnosis and remediation more difficult.
Avoiding 503 with Graceful Shutdown
Another common cause of 503 errors is that when Kubernetes terminates a pod, containers on the pod drop existing connections. Clients then receive a 503 response. This can be resolved by implementing graceful shutdown.
To understand the concept of graceful shutdown, let’s quickly review how Kubernetes shuts down containers. When a user or the Kubernetes scheduler requests deletion of a pod, the kubelet running on a node first sends a SIGTERM
signal via the Linux operating system.
The container can register a handler for SIGTERM and perform some cleanup activity before shutting down. Then, after a configurable grace period, Kubernetes sends a SIGKILL
signal and the container is forced to shut down.
Here are two ways to implement graceful shutdown in order to avoid a 503 error:
- Implement a handler for SIGTERM on the containers matched to the Service. This handler should capture the SIGTERM signal, and ensure that the server continues running until it completes all current requests, and then cleans up its activity and shuts down.
- Add a preStop hook—the container can implement a hook to ensure it isn’t killed until the grace period ends. This delays the receipt of the SIGTERM signal until the end of the grace period. This hook can then be used to finish serving existing connections to avoid 503 errors.
Resolving 503 Errors with Komodor
Kubernetes troubleshooting relies on the ability to quickly contextualize the problem with what’s happening in the rest of the cluster. More often than not, you will be conducting your investigation during fires in production. The major challenge is correlating service-level incidents with other events happening in the underlying infrastructure. Service 503 errors are a prime example of an error that can occur at the service level, but can also represent a problem with underlying pods or nodes.
Komodor can help with our new ‘Node Status’ view, built to pinpoint correlations between service or deployment issues and changes in the underlying node infrastructure. With this view you can rapidly:
- See service-to-node associations
- Correlate service and node health issues
- Gain visibility over node capacity allocations, restrictions, and limitations
- Identify “noisy neighbors” that use up cluster resources
- Keep track of changes in managed clusters
- Get fast access to historical node-level event data
Beyond node error remediations, Komodor can help troubleshoot a variety of Kubernetes errors and issues, acting as a single source of truth (SSOT) for all of your K8s troubleshooting needs. Komodor provides:
- Change intelligence: Every issue is a result of a change. Within seconds we can help you understand exactly who did what and when.
- In-depth visibility: A complete activity timeline, showing all code and config changes, deployments, alerts, code diffs, pod logs and etc. All within one pane of glass with easy drill-down options.
- Insights into service dependencies: An easy way to understand cross-service changes and visualize their ripple effects across your entire system.
- Seamless notifications: Direct integration with your existing communication channels (e.g., Slack) so you’ll have all the information you need, when you need it.
If you are interested in checking out Komodor, use this link to sign up for a Free Trial