Home
Learning Center
EKS Monitoring: Tools, Metrics, and Best Practices

EKS Monitoring: Tools, Metrics, and Best Practices

Guy Menachem

6 min read January 21st, 2024

What Is AWS EKS Monitoring?

Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service provided by Amazon Web Services (AWS). It helps manage your containerized applications in the cloud as well as on-premises.

EKS monitoring involves observing and tracking the performance and health of your EKS clusters. Effective monitoring is crucial for identifying problems before they escalate into major issues, ensuring high availability and optimal performance of your Kubernetes workloads, and understanding how to improve performance and utilization of your EKS deployments.

This is part of a series of articles about Kubernetes monitoring

Observability in Amazon EKS

Observability is the ability to understand the state of a system by observing its external outputs. In the context of Amazon EKS, observability involves understanding the state of your EKS clusters by observing output such as logs, metrics, and traces. By ensuring EKS clusters generate the right signals, you can identify issues faster, troubleshoot efficiently, and optimize your clusters for better performance.

Observability in Amazon EKS also involves understanding the dependencies and interactions between workloads within your EKS clusters. This is crucial in a microservices architecture where multiple services are interacting with each other. By understanding these interactions, you can identify bottlenecks and optimize workloads for better performance and reliability.

Achieving observability in EKS requires addressing several layers: the EKS control plane, EKS worker nodes, and the workloads and applications running within your Kubernetes clusters.

Komodor | EKS Monitoring: Tools, Metrics, and Best Practices

Itiel Shwartz

Co-Founder & CTO

Itiel is the CTO and co-founder of Komodor. He’s a big believer in dev empowerment and moving fast, has worked at eBay, Forter and Rookout (as the founding engineer). Itiel is a backend and infra developer turned “DevOps”, an avid public speaker that loves talking about things such as cloud infrastructure, Kubernetes, Python, observability, and R&D culture.

In my experience, here are tips that can help you better monitor EKS:

Integrate with Prometheus and Grafana

Use Prometheus for real-time metrics collection and Grafana for powerful visualization and alerting. This combination can provide more granular insights compared to native AWS tools.

Leverage Kube-state-metrics

Deploy Kube-state-metrics to expose Kubernetes cluster-level metrics such as pod counts, deployments, and resource limits. These metrics complement node and system metrics for a complete picture.

Implement Fluent Bit for Log Aggregation

Use Fluent Bit to aggregate logs from EKS clusters. It’s lightweight and can be configured to send logs to various destinations, enhancing log management and analysis.

Use Custom Metrics for Auto-scaling

Configure Horizontal Pod Autoscalers (HPAs) to use custom metrics instead of just CPU and memory. This can help scale applications more intelligently based on application-specific performance indicators.

Enable VPC Flow Logs

Turn on VPC Flow Logs to capture information about the IP traffic going to and from network interfaces in your VPC. This can help in diagnosing network issues and understanding traffic patterns.

Amazon EKS Monitoring Tools

Amazon provides several tools that can help you monitor EKS clusters and achieve observability.

CloudWatch Container Insights

CloudWatch Container Insights is a fully managed observability service that collects, aggregates, and summarizes metrics and logs from your containers. With Container Insights, you can monitor, troubleshoot, and set alarms for your Amazon EKS clusters.

Container Insights provides you with a detailed view of your EKS cluster’s performance, including CPU and memory utilization, network traffic, and disk I/O. It also provides insights into your cluster’s health, helping you identify issues before they affect your applications.

AWS Distro for OpenTelemetry (ADOT)

AWS Distro for OpenTelemetry (ADOT) is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. With ADOT, you can collect, correlate, and export telemetry data (metrics, traces, and logs) from your applications and infrastructure, providing a 360-degree view of EKS cluster performance.

ADOT supports a wide range of AWS services and open source tools, allowing you to collect telemetry data from multiple sources and export it to various AWS monitoring tools such as CloudWatch, X-Ray, and more.

Amazon DevOps Guru

Amazon DevOps Guru is a fully managed operations service that uses machine learning to analyze your operational data and provide you with actionable insights. It identifies potential issues and their probable causes, allowing you to proactively address them before they impact your applications.

With DevOps Guru, you can set up anomaly detection for your EKS clusters, and receive alerts when abnormal behavior is detected. It also provides you with recommendations on how to address the detected issues, helping you reduce downtime and improve your application’s performance.

AWS X-Ray

AWS X-Ray is a distributed tracing service that helps you understand how your applications and services are performing and where bottlenecks are occurring. It provides you with an end-to-end view of requests as they travel through your EKS cluster, allowing you to trace their path and understand their impact on your users.

X-Ray’s service maps let you visualize your application’s architecture, showing how services are interconnected and where performance bottlenecks are occurring. This helps you identify issues faster, troubleshoot more efficiently, and optimize your clusters for better performance.

Amazon CloudWatch Observability Operator

The Amazon CloudWatch Observability Operator for Kubernetes (CW Operator) makes it easy to set up and manage CloudWatch resources for your EKS cluster. It allows you to define CloudWatch Alarms, Dashboards, and Metrics using Kubernetes manifests, making it easier to monitor your clusters.

With the CW Operator, you can automate the process of setting up CloudWatch resources, saving you time and reducing the risk of errors. It also makes it easier to manage your monitoring setup, as you can use the same Kubernetes tools and workflows you are already familiar with.

Learn more in our detailed guide to Kubernetes observability

Metrics to Monitor in AWS EKS

Kubernetes Control Plane Metrics

Kubernetes control plane metrics are an essential part of EKS monitoring. They provide insights about the health and performance of the Kubernetes master components:

API server: This is the central part of the Kubernetes control plane. Monitoring API server metrics such as request count, request latency, and error rates can help identify performance issues or potential failures.
Kubernetes scheduler: Responsible for scheduling pods to the most appropriate cluster nodes. Key metrics include scheduling latency, error rates, and the number of unschedulable pods.
Controller manager: Responsible for reconciling the desired state of the cluster declared by the Kubernetes API with the actual state. Metrics like work queue depth and reconcile latency can help identify bottlenecks in the control loop.
etcd database: This is the reliable distributed data store which serves as the backbone of Kubernetes. Monitoring etcd database metrics like read/write latency, proposal commit latency, and leader changes can help detect issues that could affect cluster availability.

Node and System Metrics for Kubernetes

Node and system metrics are another critical aspect of EKS Monitoring. They provide insights about the health and performance of the worker nodes in the Kubernetes cluster.

CPU usage, memory usage, disk I/O, and network I/O are some of the key metrics to monitor at the node level. These metrics can help identify resource contention issues, potential bottlenecks, and any anomalies that might affect the performance of the applications running on the nodes.

In addition to the node-level metrics, system metrics like system load, system uptime, and system error rates are also important. These metrics can provide early warning signs of potential system failures.

Application Metrics

Application metrics are crucial for understanding the behavior and performance of your applications running on EKS. They can help identify application-specific issues which might not be visible at the Kubernetes or node level.

Key application metrics include request count, error rates, response time, and throughput. These metrics can help identify performance bottlenecks, potential failures, and any anomalies in the application behavior.

Metrics for Amazon EKS on Fargate

If you’re running your EKS clusters on AWS Fargate, there are additional metrics that you should monitor. These metrics can provide insights about the performance and cost-efficiency of your Fargate deployments.

Key metrics for EKS on Fargate include CPU usage, memory usage, network I/O, and storage I/O. These metrics can help identify resource contention issues, potential bottlenecks, and any anomalies that might affect the performance and cost-efficiency of your Fargate deployments.

Best Practices for EKS Monitoring

Monitor Both Cluster and Application Health

Monitoring the health of your EKS clusters and applications is crucial for maintaining their performance and availability. This involves monitoring key health indicators at all levels—Kubernetes control plane, node, and application.

In addition to the key metrics mentioned earlier, it’s also important to monitor the status of the Kubernetes objects like pods, services, and deployments. This can help identify any issues with the Kubernetes objects which might affect the health of your clusters and applications.

Set Appropriate Log Retention

Log retention is another important aspect of EKS monitoring. By ensuring you have sufficient log retention, you can better troubleshoot issues, analyze trends, and maintain the security of your EKS clusters.

It’s important to configure log retention policies that meet your operational and compliance needs, but do not take up excessive storage space. These policies should define how long the logs should be retained, when they should be archived or deleted, and who should have access to them.

Monitor Kubernetes with eBPF

eBPF (extended Berkeley Packet Filter) is a powerful technology that can enhance EKS monitoring. It allows for deep visibility into the Linux kernel without affecting its performance.

Monitoring Kubernetes with eBPF can provide insights about the network communications, file I/O, and system calls at the kernel level. This can help detect issues which might not be visible at the Kubernetes, node, or application level.

Kubernetes Monitoring with Komodor

Komodor’s platform streamlines the day-to-day operations and troubleshooting process of your Kubernetes apps. Regardless of which Kubernetes Managed Service provider you may be using (and you may be using multiple!), Komodor acts as your single pane of glass for monitoring your Kubernetes workloads, providing enhanced visibility into your clusters and integrating with popular monitoring tools like Datadog, Prometheus or any of the EKS specific tools mentioned above, for clear metric and event visualization. Additionally, it features static monitors that enforce best practices and prevent misconfigurations, and historical data retention that lets you see a complete timeline of events leading up to the current state.

Moreover, Komodor’s Workspace view feature reduces the cognitive load on K8s non-experts by filtering out irrelevant data, ensuring that they stay informed about their app’s performance data and can take swift action when issues arise. By mitigating the overwhelming flow of data that emerges from various dashboards and APMs, Komodor helps end-users own their apps e2e and operate them independently.

To learn more about how Komodor can make it easier to empower you and your teams to troubleshoot K8s, sign up for our free trial.

If you are interested in checking out Komodor, use this link to sign up for a Free Trial.

Latest Articles

The AI Empowered SRE: Keep Building & Leave the Toil to AI Agents

If a human operator needs to touch your system during normal operations, you have a bug. AI should be the primary operator for known and recurring operational tasks.

The AI-Empowered SRE: AI-Driven Service Level Objectives

e are no longer simply moving bytes; we are managing data ingestion, feature engineering, complex model serving, and real-time inference. To run a service of this magnitude, Site Reliability Engineers (SREs) must move beyond basic uptime metrics and adopt a rigorous, mathematical framework for defining quality.

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.