Kubernetes is currently the de-facto container orchestration system on the market. Both small and large companies adopt it, and all major cloud providers offer it as a service. However, Kubernetes is a complex and layered platform, so you can’t just jump into it. There are three essential stages for each application: design, deployment, and operation. This blog post will focus on operation, where you need to monitor and troubleshoot your deployed applications. Specifically, there are four golden signals you need to consider: latency, traffic, errors, and saturation. We’ll discuss them in more detail, show you how to set them up using Prometheus, and offer some best practices for using them to enhance your application’s reliability. The Four Golden Signals Let's start with the basics of monitoring distributed systems. According to the well-known Google SRE book, the four golden signals for Kubernetes are critical and must be constantly observed to ensure high availability of distributed applications. 1. Latency Latency, also known as response time, is the time it takes to fulfill a request. In other words, it is the duration of time that passes between your requests (such as an object from REST API or a webpage from the webserver) and the response from the server-side. There are two critical points you need to keep in mind for latency: Do not use averages, and always use histograms with percentile thresholds for analysis. Monitor the latency of errors separately, as they could cause you to misinterpret the results. 2. Traffic Traffic is the demand, or amount of requests, for your applications. Traffic should be defined in terms of the system, such as: HTTP requests per second for web servers Network I/O rate for streaming servers Number of transactions per second for key/value stores 3. Errors Errors show the rate of failing requests caused by server errors, sending the wrong content types, or actions against policies. Errors are critical to revealing the bugs in an application or misconfigurations in services. It’s helpful to watch for errors in every application in your stack so you can detect the faulty application more easily. 4. Saturation Saturation shows the rate of utilization for your applications, meaning how full your service is in terms of the most constrained resources, such as memory, I/O, or storage. Saturation is the most tricky of the golden signals, as you need utilization metrics in order to measure it. Critically, applications start to degrade when their utilization metrics reach close to 100%. The USE and RED Methods There are two widely accepted monitoring methods related to golden signals: The USE (utilization, saturation, errors) method: This method focuses on the utilization of resources. In other words, it considers external systems and their limitations to be the root cause of problems. The RED (rate, errors, duration) method: This method focuses on an application's parameters. It does not consider the infrastructure and external systems. In contrast to the four golden signals, the RED and USE methods focus only on one of the two views: infrastructure or application. Therefore, to create global observability, we suggest you focus on the golden signals for your distributed applications. Monitoring the Golden Signals in Kubernetes with Prometheus Now let’s take a look at the golden signals in action with Kubernetes and Prometheus. Prometheus is today’s go-to monitoring system for Kubernetes clusters, thanks to its easy-to-use features, scalability, and integration to Kubernetes. In the following tutorial, we will use a Kubernetes cluster with Prometheus installed to monitor applications, query metrics, and create graphs. First, deploy a sample application to the cluster and check its metrics. The source code of the application is available at Github, and you can deploy it with the following commands: $ kubectl apply -f https://raw.githubusercontent.com/brancz/prometheus-example-app/master/manifests/deployment.yaml $ kubectl apply -f https://raw.githubusercontent.com/brancz/prometheus-example-app/master/manifests/pod-monitor.yaml The first command deploys the sample application as a Kubernetes deployment, and the second command creates a PodMonitor resource. PodMonitor resources declaratively define how Prometheus should monitor the pods. Now, connect to the sample application and make some requests: $ kubectl port-forward deployment/prometheus-example-app 8080:8080 $ curl localhost:8080 Hello from example application. $ for i in {1..9}; do curl localhost:8080; done Hello from example application. ... You can see the metrics that are now in the application: $ curl localhost:8080/metrics # HELP http_request_duration_seconds Duration of all HTTP requests # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.005"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.01"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.025"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.05"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.1"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.25"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.5"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="1"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="2.5"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="5"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="10"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="+Inf"} 10 http_request_duration_seconds_sum{code="200",handler="found",method="get"} 0.00027571899999999995 http_request_duration_seconds_count{code="200",handler="found",method="get"} 10 # HELP http_requests_total Count of all HTTP requests # TYPE http_requests_total counter http_requests_total{code="200",method="get"} 10 # HELP version Version information about this binary # TYPE version gauge version{version="v0.3.0"} 1 Since you have HTTP server metrics ready, you can now work on the golden signals and calculate them in Prometheus. Latency Calculate the average latency using the following Prometheus query: sum(http_request_duration_seconds_sum)/sum(http_requests_total) Since the average latency can be misleading, it is better to use percentiles, such as a 90-percentile latency: histogram_quantile(0.9, sum by (le) (rate(http_request_duration_seconds_bucket[10m]))) Traffic For this golden signal, collect the HTTP server traffic for the requests: sum(rate(http_request_duration_seconds_count{}[10m])) Errors Similar to what you did for traffic, you can collect the HTTP requests with the error codes, as follows: sum(rate(http_request_duration_seconds_count{code!="200"}[10m])) Since there are no errors yet in the webserver requests, “Empty query result” is an expected result. Saturation In order to calculate saturation, you need to use external infrastructure metrics, such as CPU: 100 - (avg by (node) (irate(node_cpu_seconds_total{node="minikube"}[5m])) * 100) The percentage shows that the node has been utilized with 87% in terms of CPU. Best Practices The four golden signals are essential for monitoring your Kubernetes applications. They are pretty straightforward to collect and calculate, and are helpful for finding bottlenecks, troubleshooting, and performance enhancements. Here are some best practices for using golden signals to improve the reliability of your applications: Enrich the golden signals with additional metrics related to your business requirements. Finding correlations between the golden signals and other metrics will make it easier to troubleshoot and discover the root cause of any issues. Do not only use average values of metrics, as they can be misleading. For example, an average 5ms response time may be reasonable for you, but if 80% of the requests are over 10ms, you need to analyze the underlying cause. Therefore, you should use percentiles of metric values and the average values together. Metrics should lead to actionable alerts, so that you can intervene before your Kubernetes cluster goes crazy. Focus on creating actionable alerts for essential metrics and make them visible so that you will know where to look during debugging and troubleshooting. Conclusion While the golden signals and best practices we discussed in this post will help you operate your Kubernetes applications, they could be limited if your application stack becomes more complex and challenging to troubleshoot. In this case, you will need a Kubernetes-native troubleshooting solution, such as Komodor. Komodor tracks changes across all of your applications to analyze the ripple effect and provides the context you need to troubleshoot intelligently. To gain overall control and visibility into your Kubernetes applications, sign up for our free trial.