Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
Kubernetes is currently the de-facto container orchestration system on the market. Both small and large companies adopt it, and all major cloud providers offer it as a service. However, Kubernetes is a complex and layered platform, so you can’t just jump into it. There are three essential stages for each application: design, deployment, and operation.
This blog post will focus on operation, where you need to monitor and troubleshoot your deployed applications. Specifically, there are four golden signals you need to consider: latency, traffic, errors, and saturation. We’ll discuss them in more detail, show you how to set them up using Prometheus, and offer some best practices for using them to enhance your application’s reliability.
Let’s start with the basics of monitoring distributed systems. According to the well-known Google SRE book, the four golden signals for Kubernetes are critical and must be constantly observed to ensure high availability of distributed applications.
Latency, also known as response time, is the time it takes to fulfill a request. In other words, it is the duration of time that passes between your requests (such as an object from REST API or a webpage from the webserver) and the response from the server-side.
There are two critical points you need to keep in mind for latency:
Traffic is the demand, or amount of requests, for your applications. Traffic should be defined in terms of the system, such as:
Errors show the rate of failing requests caused by server errors, sending the wrong content types, or actions against policies. Errors are critical to revealing the bugs in an application or misconfigurations in services. It’s helpful to watch for errors in every application in your stack so you can detect the faulty application more easily.
Saturation shows the rate of utilization for your applications, meaning how full your service is in terms of the most constrained resources, such as memory, I/O, or storage. Saturation is the most tricky of the golden signals, as you need utilization metrics in order to measure it. Critically, applications start to degrade when their utilization metrics reach close to 100%.
There are two widely accepted monitoring methods related to golden signals:
In contrast to the four golden signals, the RED and USE methods focus only on one of the two views: infrastructure or application. Therefore, to create global observability, we suggest you focus on the golden signals for your distributed applications.
Now let’s take a look at the golden signals in action with Kubernetes and Prometheus. Prometheus is today’s go-to monitoring system for Kubernetes clusters, thanks to its easy-to-use features, scalability, and integration to Kubernetes. In the following tutorial, we will use a Kubernetes cluster with Prometheus installed to monitor applications, query metrics, and create graphs.
First, deploy a sample application to the cluster and check its metrics. The source code of the application is available at Github, and you can deploy it with the following commands:
$ kubectl apply -f https://raw.githubusercontent.com/brancz/prometheus-example-app/master/manifests/deployment.yaml $ kubectl apply -f https://raw.githubusercontent.com/brancz/prometheus-example-app/master/manifests/pod-monitor.yaml
The first command deploys the sample application as a Kubernetes deployment, and the second command creates a PodMonitor resource. PodMonitor resources declaratively define how Prometheus should monitor the pods.
Now, connect to the sample application and make some requests:
$ kubectl port-forward deployment/prometheus-example-app 8080:8080 $ curl localhost:8080 Hello from example application. $ for i in {1..9}; do curl localhost:8080; done Hello from example application. ...
You can see the metrics that are now in the application:
$ curl localhost:8080/metrics # HELP http_request_duration_seconds Duration of all HTTP requests # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.005"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.01"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.025"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.05"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.1"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.25"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.5"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="1"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="2.5"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="5"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="10"} 10 http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="+Inf"} 10 http_request_duration_seconds_sum{code="200",handler="found",method="get"} 0.00027571899999999995 http_request_duration_seconds_count{code="200",handler="found",method="get"} 10 # HELP http_requests_total Count of all HTTP requests # TYPE http_requests_total counter http_requests_total{code="200",method="get"} 10 # HELP version Version information about this binary # TYPE version gauge version{version="v0.3.0"} 1
Since you have HTTP server metrics ready, you can now work on the golden signals and calculate them in Prometheus.
Calculate the average latency using the following Prometheus query:
sum(http_request_duration_seconds_sum)/sum(http_requests_total)
Since the average latency can be misleading, it is better to use percentiles, such as a 90-percentile latency:
histogram_quantile(0.9, sum by (le) (rate(http_request_duration_seconds_bucket[10m])))
For this golden signal, collect the HTTP server traffic for the requests:
sum(rate(http_request_duration_seconds_count{}[10m]))
Similar to what you did for traffic, you can collect the HTTP requests with the error codes, as follows:
sum(rate(http_request_duration_seconds_count{code!="200"}[10m]))
Since there are no errors yet in the webserver requests, “Empty query result” is an expected result.
In order to calculate saturation, you need to use external infrastructure metrics, such as CPU:
100 - (avg by (node) (irate(node_cpu_seconds_total{node="minikube"}[5m])) * 100)
The percentage shows that the node has been utilized with 87% in terms of CPU.
The four golden signals are essential for monitoring your Kubernetes applications. They are pretty straightforward to collect and calculate, and are helpful for finding bottlenecks, troubleshooting, and performance enhancements.
Here are some best practices for using golden signals to improve the reliability of your applications:
While the golden signals and best practices we discussed in this post will help you operate your Kubernetes applications, they could be limited if your application stack becomes more complex and challenging to troubleshoot. In this case, you will need a Kubernetes-native troubleshooting solution, such as Komodor.
Komodor tracks changes across all of your applications to analyze the ripple effect and provides the context you need to troubleshoot intelligently. To gain overall control and visibility into your Kubernetes applications, sign up for our free trial.
Share:
How useful was this post?
Click on a star to rate it!
Average rating 5 / 5. Vote count: 5
No votes so far! Be the first to rate this post.
and start using Komodor in seconds!