K8s Health Management with Komodor

Danielle Inbar

Director of Product, Komodor

Udi Hofesh

DevRel, Marketing, Komodor

Please note that the text may have slight differences or mistranscription from the audio recording.

Kubernetes Health Management Webinar

Speakers Deck available for download

Opening

Udi: Hi everyone, and welcome to Kubernetes Health Management with Komodor. Today, we’re going to break down the concept of Kubernetes health—what it actually means to manage it—and then we’ll show you how we do it at Komodor. And there’s no better person to walk us through this topic than Danielle Inbar. So, welcome, Danielle!

Danielle: Thanks, happy to be here!

Udi: Danielle is the Director of Product at Komodor. She’s worked extensively in cloud-native environments, previously at Snyk, where she focused on container security and open-source products. She also worked at Spot.io, which was later acquired by NetApp, as a product manager specializing in cloud cost optimization, specifically for Kubernetes.

Danielle: That’s right. And way back, I was a software engineer at Motorola and started my career in QA at a company called Vingh. Outside of work, I’m a mom to two wonderful kids—Arel and Yuval—and our office dog, Charlie, who’s basically a celebrity at Komodor.

Udi: That’s awesome. Charlie is definitely the office star! So, let’s jump right into it. Kubernetes observability is a hot topic right now. Where do things stand in the industry, and why do people say that traditional monitoring solutions are broken when it comes to Kubernetes?

The Complexity of Kubernetes

Danielle: Yeah, great question. So, we all know Kubernetes is complex. And it’s not just us saying this—Tim Hockin, co-founder of Kubernetes, has also said that its complexity is only increasing.

The problem is that organizations are struggling to move forward. They want to release features faster, but they’re spending more and more time managing Kubernetes itself. The complexity becomes a tax on innovation, velocity, and scale. If engineers are constantly firefighting Kubernetes issues, they’re not building new features or improving their products.

Udi: Right, it’s a double-edged sword. Kubernetes is powerful and flexible, but that power comes with a cost. You get all these capabilities, but you also have to pay the price of maintaining and troubleshooting it. Some companies even have entire teams dedicated just to managing Kubernetes.

Danielle: Exactly. And traditionally, organizations thought about infrastructure monitoring in two layers:

Application Layer – They used APM tools like Datadog to monitor performance.
Infrastructure Layer – They had separate monitoring solutions for cloud infrastructure.

But Kubernetes changes everything because it doesn’t fit neatly into either layer. It sits in between. It has one foot in the application layer (with workloads like pods, deployments, and jobs) and another in the infrastructure layer. And it introduces new layers—like configuration, networking, storage, add-ons, operators, and CRDs—that make troubleshooting much harder.

Scaling Kubernetes Compounds the Complexity

Udi: And this problem only gets worse as organizations scale. If you’re running just one cluster, maybe you can manage it. But when you scale to dozens or even hundreds of clusters—especially in a hybrid or multi-cloud environment—troubleshooting becomes exponentially harder.

Danielle: Absolutely. That’s why so many companies struggle to maintain Kubernetes health. Once you scale up, small misconfigurations or issues that seem minor at first can cascade into major incidents. You start seeing pods restarting unexpectedly, jobs failing, nodes under high pressure—all because something upstream wasn’t configured properly.

And this is why the traditional approach to monitoring Kubernetes doesn’t work. Engineers spend hours correlating logs, metrics, and configurations across multiple layers, trying to piece together what went wrong. It’s overwhelming.

Udi: Right. And this brings us to the big question: What does Kubernetes health really mean?

[Defining Kubernetes Health

Danielle: Kubernetes health isn’t just about whether a pod is running or a node is online. It’s about understanding the bigger picture—how all the components interact and how small misconfigurations can lead to cascading failures.

Think of Kubernetes as the operating system of the cloud. It’s a platform that runs other platforms. But most engineers only focus on what’s visible—the tip of the iceberg. Underneath, there’s a complex ecosystem of configurations, dependencies, and resources that all need to be in sync.

Udi: And if something is off, the whole system can become unstable.

Danielle: Exactly. Some of the biggest challenges in Kubernetes health include:

Operational Toil – Kubernetes is inherently complex, and managing production clusters is time-consuming.
Complex Correlations – Kubernetes interacts with dozens of tools, and engineers have to manually correlate data across them.
Downtime Risks – A single misconfiguration can trigger cascading failures.
Lack of Visibility – Engineers often lack the tools to proactively detect issues before they impact users.
Productivity Impact – If teams are constantly firefighting, they’re not focusing on delivering new features.

A Real-World Example: Troubleshooting Certificate Expiry

Udi: Let’s walk through an example. Say a developer sees a web service is down. What’s the traditional troubleshooting process?

Danielle: First, the developer inspects the Kubernetes deployment and sees that all the pods are failing. They check the logs—something’s preventing the service from connecting to the database. The database itself looks fine, but the connections dropped to zero.

After an hour of frustration, they escalate the issue to DevOps. The DevOps engineer retraces all the steps, then checks the network policies—everything seems fine. Finally, they inspect the certificates and realize that a TLS certificate expired, causing authentication failures.

This entire process can take two hours or more. And all of it could have been avoided if they had better visibility into certificate health.

How Komodor Fixes This

Udi: So, how does Komodor approach this differently?

Danielle: With Komodor, this investigation would take five minutes instead of two hours. Our Kubernetes Health Management platform automatically detects issues like failed certificate renewals and alerts you before they expire.

Proactive Issue Detection – Komodor continuously monitors TLS certificates and other critical components.
AI-Driven Root Cause Analysis – We analyze logs, metrics, and configurations to pinpoint the problem instantly.
Automated Remediation – Komodor suggests fixes, and in some cases, you can apply them directly from the platform.

Instead of waiting for a service outage, teams get an early warning and can fix issues before they impact users.

Udi: And that’s the key difference. Most monitoring tools react after something breaks. Komodor helps teams be proactive, reducing downtime and improving reliability.

Final Thoughts & Takeaways

Udi: So, to wrap things up, we covered:

The growing complexity of Kubernetes and why traditional monitoring tools fall short.
How Kubernetes health goes beyond just checking if pods are running.
The challenges teams face when troubleshooting Kubernetes.
A real-world example of how long troubleshooting can take without the right tools.
How Komodor simplifies Kubernetes Health Management by detecting, prioritizing, and remediating issues proactively.

Danielle: Exactly. And Komodor isn’t just about reliability—it also includes cost optimization, user management, and role-based access control to streamline Kubernetes operations.

Udi: Awesome. Thanks, Danielle, for the deep dive! And thanks to everyone for joining. We’ll now open it up for Q&A.