How To Empower Devs to Troubleshoot K8s Independently

Rona Hirsch. How To Empower Devs to Troubleshoot K8s Independently
Rona Hirsch
DevOps Engineer @Komodor

Rona Hirsch:  Hi, everyone, welcome to my session here at DevOpsDays Tel Aviv. I’m super excited to be here. Today I’ll be talking about how DevOps can empower developers to troubleshoot Kubernetes independently.

This is actually my first time presenting, so please be kind. Before we get started, I would be happy to introduce myself. My name is Rona Hirsch. I’m a software developer and a DevOps engineer at Komodor. We’re a small startup that’s been creating a lot of buzz over the past year and a half now. We’re building the first Kubernetes native troubleshooting platform, which helps both developers and ops teams to troubleshoot Kubernetes efficiently and in a more empowered way.

Prior to this, I worked as a DevOps engineer for 5 years where I was developing and automating DevOps processes. If you ask me, I’m very passionate about Kubernetes and automation – otherwise I wouldn’t be working at Komodor right now. Specifically, I’m very passionate about this topic and I hope you find some of the insights and best practices that I’m about to share very useful and interesting.

So, without further ado, let’s get started. Essentially, I want to take you into a story or more like a scene maybe. So, imagine that right now, it’s 3:20 in the morning. It’s probably dark out and not like the way it is right now. Suddenly you hear this PagerDuty alert, and unfortunately, you’re on call – that means that you have to immediately wake up, go out of bed and start investigating what the hell happened.

Now, there may be 2 possible scenarios here. The first one is that it’s a pretty straightforward problem and solution – you find out what is the issue in like 2 minutes. You fix it right away, you go back to sleep, and everything is good. The second option is not that pretty, where it takes you the whole night to troubleshoot, trying to figure out what is the issue, and only after 4 hours, you find out what the problem is, you fix it, and you have no time left to sleep. So, which one would you prefer, the 2 minutes troubleshooting process, or the very, very long 4 hours process?

The bottom line here is that troubleshooting Kubernetes can be very complex. And why is that exactly? Well, there are few reasons for it. Let’s dive into them.

So first of all, there a lot of blind spots, right? I mean, changes are often unaudited or done manually, and you can’t really tell who did what or when. On top of that, your data is fragmented all over the place. Each person’s stack here is different. But if we go back to our PagerDuty alert, what you’ll do is you will go and look at your logging solution, your monitoring solution, your CI/CD, your repos, your feature flags, and so on. You’ll try to connect the dots and make sense of it all… and that’s not easy. Also, there’s the butterfly effect, that in the end of the day, 1 minor change in 1 service can affect other services dramatically and you can’t really pinpoint where it’s coming from.

Now, Kubernetes complexity also comes in terms of lack of knowledge. If you look at your R&D teams, you’ll see that not everyone has the same amount of knowledge. Some are maybe experts. Some have more expertise. Some are maybe beginners or have never even worked with Kubernetes before. There’s a lot of gaps there. In addition, there’s the issue of lack of permissions and access. People often complain that they don’t have access. And how can you troubleshoot efficiently if you do not have the required access? And what happens if you do have access, but you’re lacking the Kubernetes knowledge? What do you do then? Can you troubleshoot efficiently? Probably not.

And that brings me to the whole main point of this presentation – how can DevOps simplify Kubernetes troubleshooting for developers? And how can we bridge the gap between them? So, I’ll be giving you 5 best practices to help simplify the day-to-day troubleshooting process. I’ll start with the most important one maybe, and that’s monitoring. So, basically, if you don’t monitor your system, you really don’t have anything, right? You can’t troubleshoot. You can’t do anything. So, as a side note, if you haven’t picked a monitoring solution yet, now is the time to do that. There are a lot of options from open-source projects to managed ones. So, just do your research and pick one based on your level of expertise, and your requirements.

So now that we all have a monitoring solution, we need to start thinking about continuous monitoring. That means you do not want to monitor only your production environment, you always want to monitor all of your environments. This will give better insights to you and your devs about what’s going on in your system, and how frequent code changes are affecting your environment.

The next point may be basic, but a lot of companies nowadays run on cloud environments and are using their managed services. It’s really, really important to monitor them all because you want to know if your pods are crashing given that Elasticsearch is low on disk space and you need to add more nodes to it, or if your messages are stuck in a queue, for example, and maybe you need to add more replicas to your service in order to handle them. So, pay attention to that and just monitor them all.

Next, it’s really important to choose the right metrics for you. And what does that mean? What do I mean by right metrics? It means – don’t just monitor something because everyone else is monitoring it, or because it maybe sounds sexy to monitor latency, for example, but you don’t really know what that means and you don’t really know how it helps you. Monitor things based on your business logic, and things that make sense to you and your product. That will decrease the level of noise that you have – and noise is something that we really want to avoid because eventually, if you and your devs are having a lot of noise, you’ll end up muting your Slack channel, and then you’ll just miss all the relevant alerts. So, don’t do that.

My next point ties up really nicely with metrics, and that’s dashboards and specifically building useful dashboards. So, what is a useful dashboard? Is a useful dashboard one that contains a lot of graphs inside and has a lot of information? Or is a useful one a dashboard that contains less graphs, but each one of them is precise and you know that, once it fluctuates or spikes, something is wrong in your system? So, think about the dashboards that you’re using, and let your devs know how to use them, what are their purpose, and give them all of the information.

And last but not least is – don’t keep the monitoring tasks ‘til the end. We always tend to do that until it really blows up in our face, because now something is not working, production is down, and everyone is asking, “Why aren’t you monitoring this?” And then you remember to take the task out of the backlog and do it. So, it’s important to build a strong foundation from the get-go. It will save you a lot of money and time in the future.

Okay. Moving on to another important subject that is closely related to monitoring, and that’s logging. So, logging, as many of you know, is a really good tool for troubleshooting different things in your apps. When talking about logging in the Kubernetes landscape, you don’t need to give all your devs access to Kubectl or production clusters. You can use logging in a controlled environment where different people have different access permissions. So, you don’t need to set up crazy RBAC or permissions on Kubernetes to make that happen.

So, my first tip here will be to centralize all your logs into one location and manage the access from there. My second tip will be to optimize your logs by tagging and labeling them. It can be as easy as the service which is currently running, the version you’re running, some cluster information such as which node you are running on, which environment, the cluster name maybe, and all business-specific data, such as which account is currently making the request or which client, and so on. Essentially, everything that applies to you in troubleshooting.

Let me give you a small example from our actual logs of Komodor. You’ll see here that we’ve added a lot of metadata to our logs, including some of the things that I’ve mentioned now. That really helps us when troubleshooting because we can just filter out the relevant logs pretty quickly and easily and we don’t need to waste a lot of time going through irrelevant logs.

Now, we can’t really talk about Kubernetes without talking about YAML files. So, Kubernetes is based on YAML files. With them, we actually describe the desired state of our cluster or resources, if you will. So, for example, if I want to deploy a new application, I would just write a YAML file, I would specify which Docker image I want to use, how many replicas I want, the configuration, and so on.

Now, for those of you who worked with YAML files before, you know that they can be hideous but we can use them to our advantage when troubleshooting. How do we do that? If we look at the example here at the top at the label section, we’ll see that the same with logging, there’s just a lot of metadata here. And everything is listed, such as the team, even the git commit. So, using labels and annotations can really help you when troubleshooting.

On top of that, what can really help to know when something is wrong is using environment variables, config maps, and secrets. Let’s talk for a second about environment variables and config maps.

So, config maps are basically key value pairs in Kubernetes. What a lot of companies do is just plain text the list & the environment variables inside a YAML file. And then, when trying to separate environments such as production and staging, or staging and dev, they just copy/paste the app and the YAML file – that creates a problem because it’s hard to track Podspec changes. What I would suggest is just taking out all the environment variables out of the YAML file, store them inside a config map, and then you’ll just have 2 sets of config maps and 1 YAML file to keep track of.

And the last thing I’ll say about YAML files is probes usage. So, probes is probably one of the most basic concepts of Kubernetes. If you define and use them correctly, it can make it a lot more enjoyable to use Kubernetes. So, the liveness and readiness probes actually tell your cluster if your service is, one, ready to receive any requests or traffic, and two, if it’s now alive and can continue receiving the requests. If it doesn’t, the probes will fail, and Kubernetes will just tell your service to restart. So, I think that what I’ve mentioned here in this section is really important when defining YAML files. You can encourage your devs to keep up with these standards and write better, clearer, and more useful data for troubleshooting YAML files.

Now, I want to talk a bit about stateless versus stateful applications. For those of you who don’t know, Kubernetes was built for stateless applications. It is based on the premise that everything should be volatile, and not to worry about persistency. I think it’s a lot easier to write stateless applications. You don’t ever need to fear of any loss of data, or data corruption. When building stateful applications, you’ll always be in fear of what happens if, for example, you now need to scale up your service and add 5 more replicas to it? Are you using some kind of persistency or volumes? Do you have to spin up more volumes to use? What about your service? Can it handle it now that it has 10 instances of it instead of 5? Is it synchronized? When building stateless applications, you can just do whatever you want. You can scale up thousands of replicas and you’ll be fine.

So, what I’ll say here is encourage your devs to write stateless applications. But if you are using stateful applications, in a lot of cases, just let the devs know what it means to run stateful applications on Kubernetes, what they should be expecting, what are the common errors, etc. That really goes along way when troubleshooting.

My next and final topic here will be separating environments. So, I would always suggest to keep dev and production environments separated and not on the same cluster. Just think about when you’re getting an alert from dev, it’s usually not a big deal. But when you’re getting an alert from production, it’s like, “Oh, my gosh, oh, my gosh, I have to fix it right away.” Right? So, just keep the 2 apart. And, in my opinion, always keep the production environment isolated.

So, how do we do that? We have 2 options. One is using separate clusters, or even separate accounts. Kubernetes does a really good job at managing different clusters and contexts. But there’s always the issue of cost. So, another option is using namespaces. Namespaces are a mechanism for separating groups of resources within a single cluster. So, it’s a really common way to separate segregated environments. You can choose whichever option is good for you, and it will be fine.

The best practices that I’ve shared with you today are really the foundation at the end of the day of what I think is needed in order to make the troubleshooting experience for developers a bit more easy and intuitive. But let’s not forget that it’s not only these best practices that count, it’s also the enablement work that you do with your devs, getting the buy-in on the solution that you’re choosing, training them on the tools that you’re using, and iterating best practices together with them in terms of how to troubleshoot efficiently, etc. All of this really goes a long way. All of these best practices, plus the joint work and enablement work that you do with devs, are the secret sauce for making the troubleshooting experience for developers very good, confident and easy.

And with that, I’d like to thank you all for listening to this session today. Just before you go, my awesome company, Komodor, is running a daily raffle here at DevOpsDays Tel Aviv. So, if you’re feeling lucky today, please make sure to sign up and stand a chance to win. You can see my fellow Komodorians walking around the floor Expo, because they are wearing these awesome t-shirts. So, best of luck and have a great day.