Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Automatically analyze and reconcile drift across your fleet.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Meet Klaudia, Your AI-powered SRE Agent
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Automate and optimize AI/ML workloads on K8s
Easily manage Kubernetes Edge clusters
Smooth Operations of Large Scale K8s Fleets
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Your single source of truth for everything regarding Komodor’s Platform.
Keep up with all the latest feature releases and product updates.
Leverage Komodor’s public APIs in your internal development workflows.
Get answers to any Komodor-related questions, report bugs, and submit feature requests.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
Hi, I’m Udi, and I’m going to show you Komodor’s cluster health management capabilities.
As you know, Kubernetes isn’t just one thing—it’s a whole ecosystem of distributed APIs, diverse workloads, jobs, operators, controllers, CRDs, add-ons, integrations—you name it. You can think of Kubernetes like a human body. When you’re sick, a bad doctor would only address the symptoms they see, maybe give you a pill that causes side effects. But a good doctor would go deeper—look into your diet, lifestyle, and medical history—and suggest a remediation that treats the root cause, not just the fever or whatever you’re feeling in the moment.
That’s why, when you’re thinking about your cluster’s health, you can’t just look at a plain aggregation of metrics. You need to think about the system holistically and correlate different, seemingly unrelated signals in an intelligent way that enables you to draw actionable insights. That way, you not only fix issues, but also address their root causes and prevent them from reoccurring in the future.
Komodor is like a good Kubernetes doctor—and I’ll show you why.
Let’s start with the cluster overview screen, where we can see all of our different clusters in a single place. For each cluster, we provide metadata and a health score, which is comprised of the workload health and the infrastructure health. Each is divided into real-time issues—meaning issues that are affecting you right now—and reliability risks, which are potential issues that, if not addressed, may become bigger problems down the line.
Let’s look into this AWS production cluster and see why it’s not doing so well. As mentioned, you can see the workload health and infrastructure health, broken down into real-time issues and reliability risks. The most pressing issues are flagged at the top, so I know what demands my immediate attention.
Looking at the workload health, we can see seven issues. Clicking on any one of them will take us to the affected service and reveal a full timeline of events—everything that happened and changed within this specific service appears on the timeline. Just by looking at it, I can already see that this availability issue started happening directly after a deployment. So I already have a clue as to what the root cause may be.
But I don’t have to think that hard because Klaudia, Komodor’s AI agent, already did the investigation for me behind the scenes and provided a full breakdown of the event, related evidence—so I don’t just have to take Klaudia’s word for it—and a suggested remediation, which I can take directly from the Komodor platform. In this case, I can either roll back the problematic deployment or edit the config map to set the correct API rate limit.
Now, going back to the workload health section, let’s have a look at some availability risks. We have two here, one of which is flagged as high severity. Let’s start with that one.
Here, we see that a Kafka service is failing and experiencing exit code 147—out of memory. But Komodor also detected that three seemingly unrelated services are also experiencing issues right now. Komodor does this intelligent correlation for you and tells you that these three services and the issues they’re experiencing are actually related. They all stem from the failing Kafka service. Komodor groups them as a single cascading issue and shows you where it’s stemming from.
As before, Klaudia provides a full sequence of events, logs to support the conclusion, and a suggested remediation, which you can again take directly from the Komodor platform.
Let’s go back to the cluster overview and check our infrastructure health. Here, we can see three issues at the node level and two reliability risks. Let’s look at the more severe one.
Here we see an example of a “noisy neighbor”—a misconfigured workload that’s hogging all the resources from other pods on the node, causing them to be evicted. The symptom is node pressure or pods getting evicted. But Komodor saves you the second-guessing and tells you explicitly: this is the culprit, these are the victims, and this is what you need to do to fix it. In this case, we can either set correct resource requests and limits for this pod, or open a Jira ticket for someone else to take action.
To sum things up, in order to maintain your cluster’s health and ensure continuous reliability across a large number of clusters, you need to be everywhere, all the time, all at once. And since you can’t—Komodor is here to help you extend your knowledge, extend your capabilities across all clusters, and ensure that you’re always on top of what matters most.
Share:
and start using Komodor in seconds!