Product Klip: Add-On Support for cert-manager & External-DNS

The following is an AI-generated transcript:

Hi, I’m Udi, and in this video, I will show you how Komodor’s cluster health management capabilities can be extended into the wider Kubernetes ecosystem to provide valuable insights into various add-ons, operators, and CRDs.

In a previous video, I explained what cluster health management even means and how Komodor can help connect the dots between the infrastructure layer, the application layer, and the Kubernetes layer. In this video, we’ll focus specifically on External DNS and Cert Manager, and see how Komodor proactively promotes reliability across the system by continuously analyzing the behavior of these add-ons in conjunction with the app and infrastructure layers.

As you know, a lot of things can go wrong with both Cert Manager and External DNS. Maybe certificates are not being issued or are stuck pending due to misconfiguration or DNS validation issues. Maybe it’s a permissions issue, or the correct rules are missing. As for External DNS, it could be that you’re experiencing conflicting DNS entries or rate limiting by your DNS provider. All of those things can—and will—happen.

But with Komodor, you can be positively sure that you will know about them ahead of time and will be able to prevent or fix them as quickly as possible.

Let’s see what that looks like.

Right now, I’m inside my dedicated workspace, and everything I’m seeing is within the context of my applications. If I scroll past the workload health and the infrastructure health dashboards, I reach a section that shows all the Kubernetes add-ons deployed on this specific cluster, and whether there are any issues with any of them.

Right away, I can see that Cert Manager has four issues. Clicking on it takes me to the Cert Manager tab, where I can see all of my certificates. The problematic ones are promoted to the top, so I immediately know which require attention.

Here, I can see that we have a workload in production and its certificate is due to expire in just four days. The current certificate is failing to renew. If I don’t do anything about it, in four days this service will fail.

Clicking on it reveals a drawer that breaks down the issue, explaining exactly what happened. There’s an option to learn more if desired. We also have a visualization of the downstream effect of not addressing the certificate issue. So we can see that in four days, when Cert Manager fails to issue a new certificate to the PG-prod service, not only will that service fail, but three dependent services downstream will also fail.

This allows me to be proactive and resolve the issue ahead of time.

As always, we have Klaudia, our AI agent, doing the investigation behind the scenes. Here again, Klaudia has already reached a conclusion—and she brings the receipts. That way, we can trust her recommendations.

After showing the related evidence, Klaudia provides clear and simple step-by-step instructions for remediation. In this case, Klaudia concluded that the relevant DNS record is missing, which is why Cert Manager is failing to issue a new certificate to the PG-prod service. I can now follow these instructions to quickly fix the issue or share this insight with my admin so they can take action.

Let’s now look at the External DNS tab. Similarly to what we saw under the Cert Manager tab, I can see a list of all my DNS records, their status, and the last time they were synced.

Clicking on the top one reveals the drawer with all the relevant information. I can see a quick explanation of what happened and why: the DNS records are not synced with the DNS provider due to rate limits. There’s also a visual representation of the downstream effect of External DNS not syncing these records, showing the dependent services that will be affected if I don’t resolve the issue.

Once again, Klaudia, our trusted AI agent, has done the investigation behind the scenes and is already here with the solution—as well as the evidence supporting why she believes this is the root cause. As always, we have simple, step-by-step instructions for how to remediate this issue.

In this case, we need to increase the DNS provider’s—specifically, Route 53’s—API usage and rate limit policies. Komodor provides a few recommendations on how to do exactly that.

These were just two simple examples of things that can go wrong with Cert Manager and External DNS, and how we can use Komodor to resolve them.

Obviously, real-life issues are more complex and often interrelated—especially in large-scale Kubernetes environments.

Imagine a slightly more complicated scenario: a developer receives an alert that the checkout service is down. They inspect the deployment and realize all the pods are failing. The next logical step is to check the logs, which reveal that the service is failing to connect to a database.

However, inspecting the database pods shows everything is up and running. The DB logs also show no issues. But checking the DB connections reveals that they’ve dropped to zero.

Only then—after perhaps two hours or more of downtime and stress—the developer escalates to the Ops team. The Ops team runs through the same process again and dives deeper, maybe checking the network policies and the certificate. Only then do they realize the certificate has expired.

Then they realize Cert Manager is failing to renew the certificate because there’s an issuer problem. Only after inspecting the issuer do they realize that it’s all due to a DNS issue. Then they fix the DNS problem.

All of this results in hours of downtime and stressful troubleshooting involving multiple team members.

The entire scenario I just described—imagine how simple that would be to detect proactively, ahead of time, before the checkout service goes down. And it could all be handled by a single engineer, rather than an entire team scrambling to figure out what went wrong.