Home
Komodor Blog
Kubernetes is Not Just a Platform – It’s a Whole Ecosystem

Kubernetes is Not Just a Platform – It’s a Whole Ecosystem

Itiel Shwartz, CTO & co-founder

6 min read November 12th, 2024

As someone who is building a platform that is intended to make Kubernetes operations easier for everyone––I’ve learned a lot about production Kubernetes operations. The main thing I’ve noticed folks getting wrong, is that it isn’t simply a platform, it’s an entire ecosystem. Over the years since its launch, it has evolved into a jungle of add-ons, tools, and extensions that make your infrastructure smarter, faster, but this does sometimes bring with it more chaos than expected.

That’s why pretty much everyone running Kubernetes uses add-ons, and while they’re incredibly powerful and useful, they can introduce a lot of complexity. From first-hand experience, we’ve learned that organizations simply aren’t managing their add-ons well today, which poses a real risk to the stability and performance of Kubernetes environments.

Komodor | Kubernetes is Not Just a Platform – It’s a Whole Ecosystem

In the post below, we’re going to demystify some misconceptions about Kubernetes, explain why we think Add-Ons are critical for Kubernetes deployments, and how to channel their power without overwhelming your systems.

Add-Ons: Kubernetes’ Secret Sauce

Kubernetes is inherently extensible, offering a variety of core features like autoscaling, persistent volumes, and load balancing, which meet essential operational needs. However, add-ons expand its capabilities significantly, enabling advanced functionality in areas like scaling, security, machine learning, and policy enforcement. Another key extension method is through Custom Resource Definitions (CRDs), allowing users to define new resource types beyond Kubernetes’ built-in objects, enabling tailored workflows and integrations for unique use cases. While add-ons and CRDs enhance the flexibility and power of Kubernetes clusters, they also introduce management challenges and potential risks if not properly maintained.

The Add-On Landscape: Opportunities and Risks

Kubernetes add-ons can make or break your cluster. Done right, they can help your teams overcome real pains and challenges. But get sloppy, and you’re looking at a potential dumpster fire. Some of the risks add-ons introduce are through mismanagement and misconfiguration and the fact that any new add-on or tool in your stack inherently adds complexity. Below we’ll dive into how you can properly channel the potential they offer and try and bypass the risks, so you don’t find yourself firefighting outages instead of pushing features.

Here’s some examples of how they can impact your environment:

Scaling & Autoscaling: Tools like HPA, KEDA, and Karpenter handle resource scaling, ensuring your applications have what they need when they need it. But if misconfigured, you could end up with a cluster that’s either starving for resources – causing node pressure and pod evictions that affect workloads in unexpected ways, or on the other hand, clusters flooded with unused capacity – draining dollars in vain and complicating operations.
Data Workflows: Add-ons like Argo Workflows, Apache Airflow, Kubeflow, etc., help you automate and streamline complex workflows. Running them on Kubernetes allows you to leverage their dynamic nature, as well as their built-in scaling or scheduling capabilities. However, it also introduces complexity, which can be a disaster when combined with limited visibility and data engineers with no infra background.
Security & Policy Management: Policy engines like Kyverno, and OPA enforce security policies and protect your environment. But be careful—they can lock you out of critical resources if policies are too strict or applied incorrectly. One bad update, and you’re watching helplessly from the outside.
Core Services: Tools like Cert-manager and External DNS Provide essential services for managing external DNS records, automating TLS certificate issuance, and securely storing and managing sensitive data like secrets and credentials. If you don’t continuously check the health of your core services, you run the risk of downtime caused by DNS propagation delays or TLS certificate renewal failures.

Let’s take cert-manager as an example – if certs go wrong, things get ugly fast. When a cert-manager issue arises in a Kubernetes environment, the fallout can lead to service disruptions and security vulnerabilities. Without the right tools, finding, troubleshooting and fixing the issue can become a time-consuming, multi-step process.

Let’s walk through a typical investigation flow:

Problem Discovery: It starts with an alert—the Checkout service is down. This prompts developers to investigate.
Initial Troubleshooting: Developers first check the K8s deployment and find that all pods are failing. They inspect the logs, discovering an error message: Failed to connect to the database.
Narrowing Down the Issue: They inspect the database pods and find everything is up and running. Next, they check the database connections and notice they have dropped to zero, but database logs show no issues.
Escalation: At this point, the developers need to escalate the issue to the ops team (as much as they hate to do so), as the developers cannot pinpoint the root cause.
Ops Team Troubleshooting: The ops team gets to work inspecting the potential culprits – working fast and intensively to not incur downtime. They start by inspecting network policies and confirm no issues there. They then inspect certificates and discover that certificates have expired. Further investigation shows that cert-manager failed to renew them, and the root cause was an issue with the DNS settings of the certificate issuer.
Resolution: After several rounds, and many hours of around the clock troubleshooting and testing, the ops team fixes the DNS problem (it’s always DNS!), allowing cert-manager to renew the certificates. However, by the time the issue is resolved, there have been two hours of downtime, impacting critical business services.

With Komodor, there’s out-of-the-box detection and alerting so the team knows when cert-manager fails to renew certificates before it has an impact on other services or user experience. Instead of manually combing through logs and dashboards, Komodor identifies the problem, triggers an automated investigation, automatically analyzes the symptoms, and detects the root cause – sparing teams hours of manual troubleshooting.

Komodor presents a clear visualization of the root cause, showing which services are affected, the expired certificates, and the misconfiguration. This visualization eliminates guesswork, helping teams immediately understand the impact and scope of the issue. With the root cause verified, the ops team can fix the problem before it causes any business impact.

And this is just one add-on that can wreak havoc, what about the rest?

Add-Ons as a Double-Edged Sword

Kubernetes add-ons can introduce complexity, dependencies, and resource overhead that can undermine the stability of your infrastructure.

Complexity: The more add-ons you introduce, the greater the operational complexity. Each add-on requires configuration and maintenance, and each one adds a layer of potential failure. The cognitive load on DevOps teams increases as they need to understand the intricate dependencies between these add-ons and the core Kubernetes components.
Add-On Dependencies: Some add-ons may introduce single points of failure. For example, a misconfigured autoscaler or storage driver can bring down critical parts of your infrastructure, leading to downtime or cascading failures across services.
Latency and Resource Overhead: While tools like service meshes offer enhanced reliability, they also introduce latency and consume cluster resources. If not properly managed, the overhead from these add-ons can degrade the very performance you’re trying to enhance.

Add-Ons Gone Wrong: True Horror Stories

Ever had a system outage and thought, “Well, that escalated quickly”?

Kubernetes add-ons can be great, but they can turn to the dark side. Here are some first-hand horror stories from our customers that should make anyone double-check their configs:

Prometheus Overload: A company’s Prometheus setup starts out smoothly, but as more metrics are added, it gradually consumes more CPU and memory than anticipated. Before long, Prometheus becomes a resource hog, leaving critical applications struggling for compute. Monitoring the cluster ultimately turns into monitoring the resource bottleneck.
Service Mesh Misadventures: Deploying Istio was meant to streamline microservices communication, but instead, inter-pod communication broke down. The culprit? A misconfigured rule that led to service failures. The takeaway: always check your sidecar configurations before diving into debugging application code.
Storage Failures: An update to a CSI storage driver introduces incompatibility with Kubernetes, leading to data volumes failing to mount. The result? Downtime and potential data loss. When storage fails, everything else follows—and it’s rarely as simple as a “quirk” in the database.

The Komodor Approach: Keeping the Chaos in Check

So how do we tame this Kubernetes chaos? That’s where we at Komodor come in. We’ve seen it all, from clusters bursting at the seams with rogue add-ons to autoscalers misbehaving. Our approach is simple: visualize, operate, detect, investigate, optimize. In other words, we help you cut through the noise, so your Kubernetes cluster runs like the well-oiled machine it was always meant to be. As we’ve seen success in applying this approach to Kubernetes operations at scale, we are now extending this to the Kubernetes ecosystem as a whole to derive the same benefits.

What this means in practice:

Visualize: Get a clear, no-nonsense view of what’s happening in your cluster. Komodor makes the invisible visible, whether it’s a rogue cert-manager deployment or an autoscaler gone wild.
Operate: Automate the boring stuff, because nobody wants to spend their afternoon babysitting a deployment.
Detect: Spot problems early by proactively monitoring for issues in real-time. Komodor identifies potential risks like misconfigurations or performance bottlenecks, before they become disasters.
Investigate: When things do go wrong (and let’s face it, they will), Komodor will help you get to the root cause using any of our guided step-by-step playbooks, or with deep, actionable AI-powered insights, to help simplify the debugging process.
Optimize: Fine-tune your cluster so it’s running efficiently, ensuring that your add-ons are delivering maximum value without introducing unnecessary complexity.

Curious to learn more on how we do this for all popular Kubernetes ecosystem add-ons (CRDs & Operators)? Read more about it here.

The Road Ahead: More Add-Ons = More Fun

As Kubernetes continues to evolve, the ecosystem will get even richer (yet more challenging). We’re seeing huge advancements in related add-ons such as autoscaling (HPA, Karpenter), data workflows (Argo, Kubeflow), and security (OPA, Kyverno). The challenge remains in keeping it all under control and understanding the interdependencies between different addons, CRDs, Operators, workloads, and K8s-native resources.

Kubernetes may have started as a platform, but now it’s a sprawling ecosystem, with a jungle of add-ons that can either make your life easier or set you up for a cluster meltdown. The key is managing that complexity without feeling chaos. Just remember: Kubernetes might be wild, but with the right strategies and tools, you can tame the beast.

Looking to bring clarity to the complexity of your K8s environment & add-ons jungle? Then try out Komodor for free for 14 days!

Latest Blogs

The Road To KubeCon EU 2025: Top 10 Must-Attend Sessions

KubeCon is packed with cutting-edge content, especially for seasoned Kubernetes practitioners. With hundreds of sessions spanning operations, observability, platform engineering, and AI-driven automation (AIOps)

Scale Anything: How Komodor Enhances Autoscaler Capabilities

Komodor’s new add-on support for autoscalers provides unparalleled visibility into the behavior of autoscalers in your K8s environments. This ensures they perform efficiently and avoid common pitfalls while integrating effectively within your Kubernetes systems. By offering real-time insights, automated troubleshooting and proactive optimization, Komodor enhances your understanding of autoscaler dynamics and helps prevent costly mistakes.

Drift Away: The Hidden Risk of Large-Scale Kubernetes Environments

Komodor introduces Drift Management, which allows organizations to detect, analyze, and resolve drift at scale—eliminating uncertainty, reducing downtime, and strengthening governance across their clusters.