Home
Komodor Blog
Komodor Workflows: Automated Troubleshooting at the Speed of WHOOSH!

Komodor Workflows: Automated Troubleshooting at the Speed of WHOOSH!

Itiel Shwartz, CTO & co-founder

5 min read October 11th, 2021

Today, just in time for Kubecon 2021, I am happy to announce the beta availability of Workflows. For me, this is our most exciting product announcement to date – a completely new capability that expands the definition of what Komodor is, as it charts the course for its next evolution.

Let me start with the feature first. In a nutshell, Workflows is a series of smart algorithms that operate within the “depths” of Komodor. Listening to the signals Komodor collects, Workflows algorithms can automatically:

Detect Kubernetes issues (e.g., health events, schedulable resources and etc)
Correlate the information with data from external sources (e.g, Cloud providers, source code and feature flags)
Run sequences of checks that quickly pinpoint the exact root cause
Use all of the information acquired to deliver made-to-measure instructions for remediation

Typically what I described above could take hours, and likely require the involvement of several team members, working with multiple tools.

With Workflows, however, it takes a mere second for the entire process to complete, turning troubleshooting into an effortless experience – something that anyone can do on the fly.

As interesting as this all may sound, our vision for Workflows is much broader. In this post I`ll dive into what ‘Workflows’ is and the foundations it lays for the future.

Democratizing Troubleshooting

The goal of Komodor is to take the complexity out of Kubernetes troubleshooting. Setting out, this meant building a tool that would streamline the process of root cause analysis – a tool that quickly answers the “who did what?” question by taking inventory of all changes and pinpointing the thing that caused fires in the production.

Having wasted an untold amount of hours on this exact question ourselves, we thought this was a good place to start. Turns out that many other folks felt the same way and, by now, we already got used to (but not tired of) hearing “…this used to take HOURS” from our customers.

But we heard other things as well. From talking to dozens of organizations using Kubernetes we learned about the knowledge gaps that prevented some of the developers from being fully autonomous. The most common themes were:

After encountering an issue, devs didn’t know what questions they needed to ask next for additional context. (e.g., do I have a memory issue in other pods? Is this a new or ongoing problem?)
Even if they knew what they were looking for, developers couldn’t get the answer they needed, mostly due to a lack of tools or privileges.
Almost everyone we talked to found it extremely hard to correlate between different information sources, in case of the more complex issues.
Even when the root cause was detected, not everyone knew what action they needed to take to solve the issue.

As a result, in many organizations, the last mile of troubleshooting still fell on the shoulders of a few domain experts (e.g., SRE or DevOps leads) who carried the responsibility for fixing all production issues – big or small.

This created a bottleneck, but also a perfect opportunity for us to step in and improve the process. After all, we already built a tool that helped those experts troubleshoot at record speeds… Why not bake their expertise into the product, and have every developer troubleshoot common issues on the fly?

From Actionable Insights to Opinions

This is how Workflows came to be. It started by us sitting down and mapping the different actions we expected our users to take when faced with a certain K8s issue. Very quickly we saw that, with the benefit of the right insights, these actions fell into predictable patterns that could be distilled into a series of checks.

If A1, C3, E3 and F5 are true, then do X, and so forth… The actual algorithms, however, are indefinitely more complex. To demonstrate, here is a very small sub-segment of the workflow for ImagePullBackoff error:

After automating the steps an expert would take to fix the issue, our next question was: Just how opinionated did we want Komodor to be?

We definitely were not planning to be in the AIOps game. On the other hand, we were trying to automate troubleshooting and minimize the workload on our users.

With the above in mind, for the beta version, we settled on the following principles:

Don’t wrestle for control – Rather than directly fixing the issue, we decided to provide recommendations for the fix based on automated checks. This way, we could still save users a lot of time but also avoided completely invading their decision space by doing something they wouldn’t approve of.
Offer full transparency – Workflows should never be a black box. To build up trust and encourage user feedback – especially in the beta phase – we were going to be 100% clear about the checks that we run, their results, and how these informed our conclusion. As a side-benefit this would also educate about K8s, providing a bit more value by addressing the aforementioned knowledge gap.

Example of Workflow’s recommendations

Above you can see the result – a detailed summary of checks run and the suggestion for the fix that appears in the context of our main dashboard.

This is just the first iteration, and this can and likely will change. But this feels like a good place to start. With the initial feedback from the beta validating our approach and concept, we look forward to expanding the functionality to cover more troubleshooting scenarios, while also improving the user experience.

If you are interested in Workflows and want to learn more, or even join our beta program, please use this link to apply to reach out.

(Much) More to Come

As I’ve mentioned, this is just the beginning. As to what’s next, I don’t want to reveal too much but I can share that we already have plans for customization that will allow admins to create their own playbooks, granularly addressing the specific needs of their organization.

Imagine having full control of what checks our platform executes, in what order, and how their results shape the suggestion for the end-users… That could be very powerful for any organization looking to streamline K8s troubleshooting and improving control over its processes. And who said we even have to stop at troubleshooting?

We are thinking big, and this is just a taste. Stay tuned!

About Komodor

Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit Komodor website, join the Komodor Kommunity, and follow us on LinkedIn and X.

To request a demo, visit the Contact Sales page.

Media Contact:
Marc Gendron
Marc Gendron PR for Komodor
[email protected]
617-877-7480

Latest Blogs

Komodor Autonomous AI SRE Platform Selected by Nebius to Support Reliability Operations

Komodor Autonomous AI SRE Platform Selected by Nebius to Support Reliability Operations

AI-native cloud company adopts Komodor to automate operational performance and reliability.

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses root cause or gives a confident wrong answer at 2:17 AM has no place in an enterprise cloud environment.

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

You’ve deployed Karpenter and tightened your resource requests, but while you saw an initial dip in your cloud bill, your savings have flatlined.