We recently attended the 2021 Cloud-Native Days Summit, where our co-founding CEO Ben Ofiri gave a lightning round talk on How to Troubleshoot Kubernetes With Confidence.
In case you missed it, here’s a recording and transcript for your convenience.
Hi, everyone, my name is Ben, and I’m the CEO and co-founder of Komodor. And today, we’re going to talk a bit about how you can troubleshoot your Kubernetes system with confidence.
A bit about me. So, I founded Komodor, a year and a half ago together with my partner. Before that, I used to work for Google as a software developer, and later on as a product manager. And obviously, I’m a Kubernetes fan. So, actually, the only reason I co-founded Komodor was to solve the pain points that I suffered from my days as a software developer. And what me and my partner realized is that, once companies are migrating to Kubernetes and microservices, even though it gives you a lot of advantages and you can move much faster, it also comes usually with significant downsides.
And one of the most painful downsides of using a highly distributed system that changes so rapidly is that every time there is an issue, every time there is an incident, which obviously, as you probably know happens a lot, the first question you ask yourself is what changed in the system, right?
And unfortunately, answering this question became a very, very hard mission for most R&D organizations. It, first of all, starts because of the reason that there are so many things that get unnoticed. So, there are a lot of changes, which you basically don’t know about. So, we tried to find the root cause, it’s very hard to know what might have caused it if you don’t know what change.
Then even if you have all of those changes in front of you, the data is so fragmented that you basically need to query different tools and be an expert in all of those different layers like Kubernetes, Datadog, GitHub, Jenkins, configuration management, AWS, etc. So, basically, you need to be an expert in all of those tools just to fetch the right information and then collapse it together.
Then you obviously need to handle the butterfly effect, how a specific change in one tool actually affected the different component that relies on that. And understanding those butterfly effects is becoming a very hard task once your system is distributed and consists of 1000s of different components.
And lastly, even if you don’t have the 3 above, you need… so many people to know how to do that. So, we need your developers to be almost an expert in troubleshooting, in understanding the production, in understanding how to tie different pieces of information from all of those different tools and understand the ripple effect. And it’s not enough that you have like 1% or 2% of your team knowing how to do that, you probably want your entire R&D organization to be able to troubleshoot efficiently. And we know that most R&D organizations just don’t have the right tools and capabilities to do that efficiently.
So, this is basically the reason why we founded Komodor. And to give you like a small illustration, this is probably how a typical troubleshooting period looks like for an average organization. So, on the left, you get an alert from Datadog or from PagerDuty. And on the right, this is all the tasks you need to do, and probably much more than that, just in order to try to understand what might have caused the issue and to find the root cause ASAP. So, we need to go to GitHub to Jenkins, to Datadog, to Kubernetes, and you need to perform different tasks in all of those different tools, tasks that are very non-trivial. In fact, this is how a deep dive into Kubernetes troubleshooting looks like.
So, it’s only one layer, but you can see how complex it is. You can see how many different steps are needed just to troubleshoot the easiest and most common things in Kubernetes, like a pod that is being restarted. So, we can understand how much time it takes, but also how much expertise it requires to do those kinds of troubleshooting and those kinds of tasks efficiently and independently.
We at Komodor offer a different approach to troubleshooting Kubernetes systems. We try to reverse the order of troubleshooting. So, instead of you getting an alert, now we need to go to all of those different tools, perform all of those different tasks and queries to fetch the relevant information, and then to digest it and then to find the root cause, we offer to change the order around. So, what we do, we constantly track all of the interesting changes and events across the different tools in teams.
So, we collect code changes, configuration changes, Kubernetes deployed, health events, and alerts from third-party apps. And then we digest all of this information and map it into smaller business units like microservices or themes.
So, basically, our users are having the entire context they need to troubleshoot efficiently their own team or their own service once they are alerted. So, we’re not only saving this precious time, but we’re also lowering the bar of expertise needed to troubleshoot very complex cases because now, it can be a few clicks away basically to solve a very complex issue in Kubernetes, end to end.
And if you want to talk about how it looks like, so in a high-level overview, basically what we offer is we collect all of the events from code changes, configuration changes, alerts, and topology mapping. And then we construct a coherent view of, first of all, all of the services that you have in your system. And what’s their current state? Meaning are they healthy or not? Did they change recently, etc.? And once you want to deep dive into a specific service or a specific theme, you get a full-timeline activity of everything that changed across your toolchain that actually affected this specific service. So, in a single click, you can actually identify suspicious changes and actually take action to solve them. So, in the next couple of slides, we’re going to see one real example of how one of our users used Komodor to troubleshoot a real alert they had in their system a few weeks ago. And it’s going to give you a glimpse of the abilities and capabilities that we offer for our users.
So, it started with an alert, a very common alert you probably all know, about the high latency in one of the backend services. So, we know that usually, at this point, the alert will be routed to Slack where the whole mess begins. People will ask, “Who did something in the logs?” People ask, “Who just deployed something?” You’ll probably ask some people to help you to understand if there was a change in production recently. We want to save this a few more moments of chaos by basically offering the relevant context you need to start the troubleshooting from a data-driven approach.
So, for that, we build a very lightweight Slack application that, from the alert, tries to understand what’s the real affected Kubernetes service. In this example, it’s the service named authenticator, and what happened in your services, in other places in the system that might be relevant or might be correlated to this specific service. So, we can see that, in this example, the bot found a few things, one of the things that get the attention from the on-call developer is a deploy that happened right before the alert on the same service, the authenticator service. So, obviously, it’s a good place to start the troubleshooting period.
So, the user clicks on, “Show me more details on the service on Komodor.” Now, what you see here is basically all of the events we collected from Kubernetes, GitHub, Datadog, etc., that actually affected this specific service, the authenticator service, in the last hour.
So, what’s interesting to see is that you can see the alert that started this specific troubleshooting phase, and you can see the deploy that happened right before that, that started and ended right before the alert. And you can also notice another interesting thing right now, is that right after this deploy, the Kubernetes service became unhealthy. And it even says that there are not enough replicas available for these services to be healthy. So, obviously, it makes this deployment even more of a suspect. Like there was a deploy, right after it Kubernetes became unhealthy, and Datadog sent the relevant alert.
So, we probably, at this point, want to deep dive into what happened in this deploy. So, in a single click, this is exactly what Komodor provides for you. It gives you the ability to see automatically, for each one of your deploys, what happened in the code layer? So, we can see the PRs and commits that changed in GitHub. You can also see logs to the relevant deploy job itself that deployed this specific change. In this example, it’s a buildkite job. You can see the logs to see if there’s anything interesting there.
More than that, because we have a very native integration with Kubernetes, you can even click on the diff, and automatically get a diff-like view of everything that changed in the Kubernetes manifest before and after each one of the deployments. So, basically, nothing can escape.
So, here, it’s very easy to see that only 2 things changed in the Kubernetes layer. The image changed, because there was a core change. But also, you can very easily notice the number of replicas changed and decreased from 50 to 5, meaning someone thought it’s a great idea to reduce the number of replicas by 90% for this specific service.
Now, at this point, it’s very obvious for the on-call developer, that if you remember, they try to find the root cause for the high latency in the service. Now, they not only found a specific deploy that happened right before that, but now they even realize that this deploy reduced the number of replicas by 90%. And this is probably the root cause for the specific symptom they’re having.
So, they probably want to take action. Luckily for them, Komodor allows the users to also validate the most common issues by taking action that can solve the issue end to end. So, we offer users to take actions from our pre-selected menu, like a ‘scale up’ or ‘revert’, etc., or to run more sophisticated playbooks that we can also support.
In this example, the user chose to scale up the number of replicas back to 50, and basically to bring back the situation to the previous state before the alert hit. So, this gives you an end-to-end example of how our users are using Komodor on a daily basis to troubleshoot their system, and how much time we can save and how much expertise was needed just to troubleshoot a very simple use case in Komodor.
So, if you’re wondering how much time it takes until you can get your hands on that, the good news, it’s only 5 minutes away. So, basically, in order to get Komodor and to start having the value I just showed you, only 2 things are needed. We install a Kubernetes agent on your system, which takes roughly a few minutes. It’s a very standard Helm chart that you can install in 2 commands. And it’s very low weight on the requirements. Besides that, in order to support GitHub, Slack, PagerDuty, Datadog, Centreon, all of data integration support, we have read-only, non-intrusive integrations, that basically take a few clicks to install. And from that moment, you can already see the value that I just showed you earlier and much more.
So, if you want more information, please visit our online booth in the cloud-native virtual summit, so we can meet our online reps, and they can answer every question that you might have. And also, please feel free to check our website and social media for more content and updates regarding what we do. And if you feel that you’re already ready to troubleshoot Kubernetes like a pro, we offer all of our users a free trial, and you can start using it today. So, please go to komodor.com and sign up. And please ask for a free trial, and we’ll get back to you and make sure that you can start troubleshooting with ease. Thank you so much!