Product Klip: Add-On Support for Workflow Automation Engines

The following is an AI-generated transcript:

Hi, I’m Udi from Komodor. In this video, I’ll show you how Komodor enables data teams to gain visibility and operational understanding of the data pipelines they are running on Kubernetes and how to troubleshoot them independently when something inevitably breaks.

Komodor’s capabilities for workflow automation engines like Kubeflow, Airflow, and Apache Spark are designed to tackle three main challenges:

  1. Lack of visibility into workflow failures – When a job fails, users often have no idea why. Logs are scattered across multiple pods, and if a pod is deleted, all of its historical data is gone, making troubleshooting nearly impossible.
  2. Runaway costs – As you know, ML workflows consume massive compute resources. Without proper monitoring, costs can spiral out of control before you even notice.
  3. Dependency on platform teams – The barrier to entry for data engineers is far greater than for application developers. So, when issues arise, they’re often escalated to DevOps, MLOps, or platform teams, which causes ticket overload, slower resolution times, and bottlenecks that affect development and overall business goals.

Let’s see what life looks like for data engineers with Komodor.

If I go to my overview screen and scroll down to the Kubernetes add-ons tile, I can see that I have two issues with my workflow automation tools. Clicking on it takes me to the Workflows tab, where I can see all the workflow engines I’m currently using, along with some relevant metadata—and most importantly, their status.

Right away, we can see the two issues that were highlighted on the overview screen. The first is with Airflow. Clicking on it brings up a timeline of all the workflows running concurrently. I can see that a workflow failed at some point.

It’s important to note that Komodor differentiates between different phases of the workflow. Just because something is pending or not ready doesn’t mean something is wrong. But when something is wrong, Komodor will let you know. You can count on Komodor to not just flag every pending pod, but only the ones that require your attention.

Clicking on this failed pod shows that it failed because its hosting node was terminated—a fairly common event. Thanks to Klaudia’s AI analysis, we can see exactly what happened, when it happened, and why. In this case, the node was terminated due to a scale-down event triggered by Carpenter. This is a good example of how different add-ons can affect each other. We’ll have a separate video about cluster autoscalers, but for now, imagine this as a data engineer: a scale-down event caused by Carpenter is completely out of your scope or understanding.

But with Komodor, it’s easy for anyone to understand the sequence of events, what they mean, and what should be done to fix the issue. Without reading any Carpenter documentation, opening AWS support tickets, or escalating to your MLOps team, you can simply rerun the workflow with one click.

You can even take a proactive approach—adding an annotation to the pod to avoid a similar event in the future. Sending a screenshot of this to your MLOps engineer will definitely earn you points with the platform team.

Now, let’s take a look at the other failed workflow—this time it’s an Argo Workflows one. The UI is a bit different, but just like before, we can see when pods are pending, when they’re running, and most importantly, when they fail. Once again, this pod failed due to a scale-down event. Carpenter is, once again, being a bit naughty.

Thankfully, Klaudia is here to help. She provides clear, step-by-step instructions to not only remediate this specific failure but also prevent it from happening again. For example, increasing the node pool cooldown limit for Carpenter is a good start. But something even more advanced would be adding a taint to GPU-based nodes, so non-GPU workloads aren’t scheduled to them—ensuring those critical GPU nodes are always available for the jobs that truly need that computing power.

All of this can be done directly within the Komodor platform, without needing to switch tools or escalate to another team.

This capability gives data engineers full ownership of their workflows. If something breaks, they know how to fix it themselves—and they’ll know about it in time. As soon as something is suspicious or risky, Komodor will alert you and guide you through how to resolve it on your own.