Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Automatically analyze and reconcile drift across your fleet.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Meet Klaudia, Your AI-powered SRE Agent
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Automate and optimize AI/ML workloads on K8s
Easily manage Kubernetes Edge clusters
Smooth Operations of Large Scale K8s Fleets
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Your single source of truth for everything regarding Komodor’s Platform.
Keep up with all the latest feature releases and product updates.
Leverage Komodor’s public APIs in your internal development workflows.
Get answers to any Komodor-related questions, report bugs, and submit feature requests.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
Hi, I’m Udi from Komodor. In this video, I’ll show you how Komodor enables data teams to gain visibility and operational understanding of the data pipelines they are running on Kubernetes and how to troubleshoot them independently when something inevitably breaks.
Komodor’s capabilities for workflow automation engines like Kubeflow, Airflow, and Apache Spark are designed to tackle three main challenges:
Let’s see what life looks like for data engineers with Komodor.
If I go to my overview screen and scroll down to the Kubernetes add-ons tile, I can see that I have two issues with my workflow automation tools. Clicking on it takes me to the Workflows tab, where I can see all the workflow engines I’m currently using, along with some relevant metadata—and most importantly, their status.
Right away, we can see the two issues that were highlighted on the overview screen. The first is with Airflow. Clicking on it brings up a timeline of all the workflows running concurrently. I can see that a workflow failed at some point.
It’s important to note that Komodor differentiates between different phases of the workflow. Just because something is pending or not ready doesn’t mean something is wrong. But when something is wrong, Komodor will let you know. You can count on Komodor to not just flag every pending pod, but only the ones that require your attention.
Clicking on this failed pod shows that it failed because its hosting node was terminated—a fairly common event. Thanks to Klaudia’s AI analysis, we can see exactly what happened, when it happened, and why. In this case, the node was terminated due to a scale-down event triggered by Carpenter. This is a good example of how different add-ons can affect each other. We’ll have a separate video about cluster autoscalers, but for now, imagine this as a data engineer: a scale-down event caused by Carpenter is completely out of your scope or understanding.
But with Komodor, it’s easy for anyone to understand the sequence of events, what they mean, and what should be done to fix the issue. Without reading any Carpenter documentation, opening AWS support tickets, or escalating to your MLOps team, you can simply rerun the workflow with one click.
You can even take a proactive approach—adding an annotation to the pod to avoid a similar event in the future. Sending a screenshot of this to your MLOps engineer will definitely earn you points with the platform team.
Now, let’s take a look at the other failed workflow—this time it’s an Argo Workflows one. The UI is a bit different, but just like before, we can see when pods are pending, when they’re running, and most importantly, when they fail. Once again, this pod failed due to a scale-down event. Carpenter is, once again, being a bit naughty.
Thankfully, Klaudia is here to help. She provides clear, step-by-step instructions to not only remediate this specific failure but also prevent it from happening again. For example, increasing the node pool cooldown limit for Carpenter is a good start. But something even more advanced would be adding a taint to GPU-based nodes, so non-GPU workloads aren’t scheduled to them—ensuring those critical GPU nodes are always available for the jobs that truly need that computing power.
All of this can be done directly within the Komodor platform, without needing to switch tools or escalate to another team.
This capability gives data engineers full ownership of their workflows. If something breaks, they know how to fix it themselves—and they’ll know about it in time. As soon as something is suspicious or risky, Komodor will alert you and guide you through how to resolve it on your own.
Share:
and start using Komodor in seconds!