Product Klip: Cluster Health Management

The following is an AI-generated transcript:

Hi, I’m Udi, and I’m going to show you Komodor’s cluster health management capabilities.

As you know, Kubernetes isn’t just one thing—it’s a whole ecosystem of distributed APIs, diverse workloads, jobs, operators, controllers, CRDs, add-ons, integrations—you name it. You can think of Kubernetes like a human body. When you’re sick, a bad doctor would only address the symptoms they see, maybe give you a pill that causes side effects. But a good doctor would go deeper—look into your diet, lifestyle, and medical history—and suggest a remediation that treats the root cause, not just the fever or whatever you’re feeling in the moment.

That’s why, when you’re thinking about your cluster’s health, you can’t just look at a plain aggregation of metrics. You need to think about the system holistically and correlate different, seemingly unrelated signals in an intelligent way that enables you to draw actionable insights. That way, you not only fix issues, but also address their root causes and prevent them from reoccurring in the future.

Komodor is like a good Kubernetes doctor—and I’ll show you why.

Let’s start with the cluster overview screen, where we can see all of our different clusters in a single place. For each cluster, we provide metadata and a health score, which is comprised of the workload health and the infrastructure health. Each is divided into real-time issues—meaning issues that are affecting you right now—and reliability risks, which are potential issues that, if not addressed, may become bigger problems down the line.

Let’s look into this AWS production cluster and see why it’s not doing so well. As mentioned, you can see the workload health and infrastructure health, broken down into real-time issues and reliability risks. The most pressing issues are flagged at the top, so I know what demands my immediate attention.

Looking at the workload health, we can see seven issues. Clicking on any one of them will take us to the affected service and reveal a full timeline of events—everything that happened and changed within this specific service appears on the timeline. Just by looking at it, I can already see that this availability issue started happening directly after a deployment. So I already have a clue as to what the root cause may be.

But I don’t have to think that hard because Klaudia, Komodor’s AI agent, already did the investigation for me behind the scenes and provided a full breakdown of the event, related evidence—so I don’t just have to take Klaudia’s word for it—and a suggested remediation, which I can take directly from the Komodor platform. In this case, I can either roll back the problematic deployment or edit the config map to set the correct API rate limit.

Now, going back to the workload health section, let’s have a look at some availability risks. We have two here, one of which is flagged as high severity. Let’s start with that one.

Here, we see that a Kafka service is failing and experiencing exit code 147—out of memory. But Komodor also detected that three seemingly unrelated services are also experiencing issues right now. Komodor does this intelligent correlation for you and tells you that these three services and the issues they’re experiencing are actually related. They all stem from the failing Kafka service. Komodor groups them as a single cascading issue and shows you where it’s stemming from.

As before, Klaudia provides a full sequence of events, logs to support the conclusion, and a suggested remediation, which you can again take directly from the Komodor platform.

Let’s go back to the cluster overview and check our infrastructure health. Here, we can see three issues at the node level and two reliability risks. Let’s look at the more severe one.

Here we see an example of a “noisy neighbor”—a misconfigured workload that’s hogging all the resources from other pods on the node, causing them to be evicted. The symptom is node pressure or pods getting evicted. But Komodor saves you the second-guessing and tells you explicitly: this is the culprit, these are the victims, and this is what you need to do to fix it. In this case, we can either set correct resource requests and limits for this pod, or open a Jira ticket for someone else to take action.

To sum things up, in order to maintain your cluster’s health and ensure continuous reliability across a large number of clusters, you need to be everywhere, all the time, all at once. And since you can’t—Komodor is here to help you extend your knowledge, extend your capabilities across all clusters, and ensure that you’re always on top of what matters most.