When you’re using an application or tool, it’s very important to make sure things are working as they should. For this reason, health checks are critical. A health check is when an application or tool checks its own components and dependencies, then either publishes or exposes a notification method if there is a problem.
In this article, we’ll take a look at Kubernetes health checks, readiness checks, how to implement an effective health check, and why health plays an important part in troubleshooting.
Health Checks in VMs and Containers
When considering traditional ways of deploying apps in virtual machines (VMs), health checks need to be configured on the load-balancer side, so that the load balancer can add or remove machines from its configuration and, thus, manage traffic.
Health checks are also important for Kubernetes to make sure containers are running as they should. One pod can have multiple health checks for the different containers running in it. In the past, there was no concept of readiness checks in traditional cluster management tools, like Apache Mesos. However, you can use readiness checks to manage traffic, only routing traffic to the pod if the readiness check passes; in this role, they are now a very important part of Kubernetes.
Health Check Internals
In a health check, you define the endpoint, interval, timeout, and grace period:
- Endpoint/CMD: This is the URL/CMD that you want to call or execute to verify the health check. An HTTP 200 status code is considered healthy. A cmds 0 exit status code is successful. In the case of a TCP connection, if the application is accepting the connection, then it is live.
- Interval: This is the time period between two health checks.
- Timeout: This is the amount of time the entity performing a health check will wait before determining that it’s a failure.
- Grace period: Once the application is running, this is how long you have before the health check will start.
When it comes to Kubernetes, there are two kinds of health checks: liveness probes and readiness probes. A liveness probe tells you that the application is up and running, while the readiness probe tells you that the application is ready to accept traffic.
Kubernetes also has a startup probe, which is used to protect the slow-starting containers.
Here are the types of readiness/liveness probes:
- Command: You can write a script or execute a bash command. When executed, this script or command will return an exit code. Exit code 0 means the application is healthy, while other exit codes mean the application is not healthy.
- HTTP: This is a simple HTTP request to the pod endpoint (for example, /health).
- TCP: If your application has an open port, Kubernetes can do a simple TCP connect check. If the connection is fine, the application passed the health check.
- gRPC: You can use gRPC-health-probe in your container to enable the gRPC health check if you are running a Kubernetes version 1.23 or less. After Kubernetes version 1.23, gRPC health checks are supported by default natively. For information about how to enable this, read the official documentation.
- Named ports: You can use a port definition to define the HTTP and TCP health checks. This is not supported in gRPC.
Here is an example of putting a health check in HTTP mode:
apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-http spec: containers: - name: liveness image: k8s.gcr.io/liveness args: - /server livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 3 periodSeconds: 3
In the above example, the liveness probe is set on the
/healthz endpoint on port 8080 after an initial delay of three seconds and an interval of three seconds.
For other ways of doing health checks, check out Kubernetes health check syntax.
How Kubernetes Performs Health Checks
In each worker node in Kubernetes, there is a component called a kubelet that is responsible for launching, managing, and deleting a pod. Kubectl is the component that does a health check on the containers. It uses the health checks to determine if the containers are running fine or if it has to kill the container.
The kubelet talks to the API server in Kubernetes to get relevant information about the pods it needs to launch and then informs the API server if there are any pod terminations. If the kubelet goes down, even for a few minutes, nodes may go into a Not Ready state, and all the pods may be relaunched at another node.
How Do Health Checks Enable Faster Troubleshooting?
To understand how health checks enable faster troubleshooting, consider the following example. Let’s say there is an application deployment and a service in front of the deployment to balance the traffic. If you can’t see the application coming up, you will start troubleshooting by checking the health check. If you have set up the proper health check, the application pods will get killed and you can put an alert on the pod status, which will tell you that the application pods are failing the health check.
Sometimes, your application passes a health check and the application pods are up, but the application is still not receiving any traffic. This can happen if you have implemented a readiness check, but it is not successful. If the readiness check is failing, Kubernetes will not add your pods to the service endpoint and your service will not have any pods to send traffic to.
There are a few commands that can help you debug issues more quickly.
Use the below command to see if your containers are up and running. This command will show how many pods are up in your deployment.
kubectl get deployment deployment_name -n dep_namespace
If you see that the pods are not up, look at the deployment descriptions or events with the following command. This will show how many pods are up in the replica set.
kubectl describe deployment deployment_name -n dep_namespace
Then, you can take the replica set name and check what is happening in the ReplicaSet events. This will show if there is any issue bringing up your pods.
kubectl describe replica set replicaset_name -n dep_namespace
Lastly, you can describe your pods to see if they are failing the health checks.
kubectl describe pod pod_name -n dep_namespace
You can also try looking at the pod’s logs to identify why it failed the health checks, using the below command.
kubectl logs -f pod_name -n dep_namespace
In addition, checking the events in deployments, replica sets, pods, and pod logs will tell you a lot about any issues. If you are running StatefulSet, you can use the same commands for troubleshooting.
You can also check if the endpoint object in the service has your pod IPs or not.
kubectl describe service service_name -n dep_namespace
If you have a load-balancer service, you will be able to see your instances attached to the load balancer. If your health check is failing, the load balancer will remove the instances, and traffic won’t be forwarded to the instances.
Kubernetes events are very important when you are troubleshooting. Most of the time, you will be able to find the issue in one of the Kubernetes events. You can easily see events related to any Kubernetes object using the Kubernetes describe command.
Common Health Check Pitfalls
There are several common pitfalls you may run into when running Kubernetes health checks:
- HTTP applications should not have TCP health checks, as they will mark your application as healthy on port binding, even if your actual HTTP service is not running. Write a proper health endpoint, where you should check the application’s dependencies and then make it live.
- Always put readiness checks in your servers. This ensures that your application does not prematurely receive traffic it cannot serve.
- Avoid TCP health checks for databases like Redis. Redis can be live once the Redis server is running, but this doesn’t ensure that Redis has joined the cluster or started as a master slave or its final configurations. In these cases, use the liveness command interface to make sure Redis or the databases are in your desired state.
- Avoid verifying dependencies in your health check that are not necessary for the application to be running. Also, avoid health check loops. For example, application X needs applications Y and Z to be online, and Z needs X to be online. In this scenario, if one application goes down, they will all go down, as they are dependent on each other.
Health checks are clearly important for every application. The good news is that they are easy to implement and, if done properly, enable you to troubleshoot issues faster. If you log exactly why a health check failed, you can pinpoint and solve issues easily.
While these tips can (and will) help minimize the chances of things breaking down, eventually, something else can go wrong – simply because it can.
This is the reason why we created Komodor, a tool that helps dev and ops teams stop wasting their precious time looking for needles in (hay)stacks every time things go south. To learn more about how Komodor can make it easier to empower your teams to shift left and independently troubleshoot Kubernetes-related issues, sign up for our free trial.