Below are main highlights from a recent Clubhouse talk featuring Elad Aviv, a software engineer at Komodor. The session was hosted by Kubernetes heavy-hitters; Mauricio Salatino, Staff Engineer at VMware, and Salman Iqbal, Co-founder of Cloud NativeWal.
How do you figure out what’s gone wrong while using kubernetes?
As a developer, and working, building cloud native applications, I will tend to look into, first of all, so frameworks and some projects into the CNCF space. It’s all about observability, right, like understanding about what’s running in your cluster, understanding of what’s running, If it’s running correctly, and how do you expect it to be running, for example, on CPU consumption, or memory consumption, to make sure that, like, your applications are not doing something weird.
And then for example, looking into projects like open telemetry that can give you insights about, you know, the communications between the different services, and if there is something, you know, being like completely slow on the network side, or something that is crashing, for example.
So in general, I will tend to say that, first of all, I will check that all my containers, for example, are running in Kubernetes. Those are usually called bots, you will take a look at the ports and making sure that they are running, and making sure that they are not failing in a very, you know, simple way, like, you know, readiness probes and liveness. Probes are not wrong.
And then I would just, you know, try to look a little bit more from a higher level, maybe having some kind of like a dashboard to understand, what are the things that are running? And if there are failures, what are the things that are, you know, constantly failing and causing my applications to go down?
What are the challenges to find and debug issues?
You have have many tools you would like to use, or you need to use to really understand your system and understand what the current status is. But, you know, it’s kind of hard to even know, what’s the right status, right?
Like, you have liveness probes and readiness probes, but you mentioned, CPU consumption, and memory consumption, sometimes you get an alert, and you won’t really know that it’s actually something bad. Like, it might be just a fluke or something that is not actually wrong.
And this, even when you have this visibility, the noise can be quite hard to filter out from, from the actual, you know, mess that you might get. So that’s one of the biggest challenges.
To me, basically, when I when I look at it, like, as a person that doesn’t really need to understand the operations level in an implementation. Like, manner. It’s kind of hard to know, what exactly is wrong about the current the current state? Or why is it wrong to have 100% CPU for five minutes? Once? Once every hour? Like it might be? Okay. So that’s one of the challenges that to me.
Where do we start when we have to debug?
So usually in an application in Kubernetes, I know, if you’re aware of Kubernetes, you’ll see there’s like lots of terms which are similar to other things. So you have a pod, which is running your application inside the server that’s running your application or container inside. And on top of that, you have the service, which acts as an internal load balancer. And then on top of it, you’ve got this. You’ve got an Ingress, which is doing like, let’s say, external load balancer, because it’s a website.
So we’ve got all these, these three things. But that’s not the only thing. But we’ve got other things as well. We’ve got like secrets, volumes, and all that sort of stuff. But if we just focus on like a website, let’s say and these three things are usually if I want to debug something, I’ll start at the bottom level I’ll see is my is my pod running. Right? Right. Is that right? Okay, is the container running fine. So I’ll check that first.
But of course, I have to use Kubectl commands, or I can use one of the tools that are there, and then I’ll work my way up, I’ll check if the service is correct. And then I’ll check if the English is correct . So you know, that’s, that’s how it is.
How do you solve your debugging issues?
So one of the, I mean, the hypothesis of Komodor I think, and I hope we all can relate to it, because I hope we’re right with the hypothesis is that you usually want to find out when something started and what was changed.
At that point, like, you get a bunch of incidents or a bunch of alerts from pager duty, or data dog or whatever tool you use to monitor your production environment. And when you get this when you know that some application is in a bad state for a few hours now or a few days or whatever, you really want to want to know what change did cause this issue like you ask yourself, when did it start what happened at that point?
And that’s what we’re really trying to solve here. So for me, I tried to use Komodor like this. So when we get an alert on one of our services, we can get it through Komodor as well because it actually types the alerts through Komodor and shows it on the timeline.
And the point is, they can actually relate incidents to some kind of change and try to work my way out From there, right like understand the logs better or try to just revert the thing and go back to sleep if it’s 2am. So that’s, that’s the flow I tried to use, obviously, because I have Komodor available in my organisation, so it works well. Yeah.
Does everyone in charge of fixing Kubernetes have access to the cluster?
In general, they do not have access to the cluster, they will have only access to the monitoring dashboard, probably. And even if they have access to that what they are going to be really interested in is in the alerting mechanisms, that basically will notify them in the case that something goes wrong, right.
And, for example, having a message in slack saying, you know, your production cluster is, is you know, it’s consuming too many resources or doing something crazy, or the service is not working, is something that it’s really useful for, for these teams that are, you know, 24/7, monitoring the production environment, and definitely not going and changing stuff manually with Kubectl. But having some way, like a nice visibility and alerting mechanisms to make sure that we understand when things are going wrong.
What metrics should be checked?
You should check for example, that your services are not like like the new version that you just deployed. It’s not it’s not this lower than the previous one.
Then, of course, that their service is not crashing, for example, that you know that you have imports all over the place, that kind of stuff, I think, at the end are going to define these recipes to make sure that when you going to install these, these metrics, and you create the alerts for those metrics, then you know how to proceed and how to pick the problems.
And I do see kind of like the Kubernetes community evolving into that space now where we create best practices, and then we create solution for common issues.
What is your experience with service meshes?
I tend to stay away from service meshes. Because I had, I had unfortunate instances of trying to debug what’s going on. when things don’t work. I try stay away from complexity, if possible.
But we should definitely discuss it especially at some point. And we’ve got quite a few up there. And yeah, and it does definitely help. I think, I don’t know who mentioned it, but it does definitely help with, you know, tracing, because they will give them I think everything should start with.
I agree with sweetie, what she said a few minutes ago, everything should start with you to make sure your application is actually developed, like in a cloud-native way. So you know, following all those 12 factor apps, everything is logged correctly, all the requests again, log correctly, and all the things are fine. So yeah, we definitely will not service most of that.
What is the main pain point Kuberents solves?
With Kubernetes, you can actually give everyone the ability to do anything they want. So I really think that that’s the point where we want to be, I believe, as a community where everyone can have this autonomy and be independent to actually own incidents and issues and fix things in their organization without having to rely on the teams that you are in.
Like you can provide infrastructure, while developers like me can use that infrastructure to actually fix stuff so yeah, that’s that’s the main point.
How do you control who gets access to which cluster?
I’m just going to say that it all comes from, you know, users accounts and what kind of permissions we’ve given them in within Kubernetes.
Like what what you know, what kind of cluster roles we’ve because, you know, you can define roles within Kubernetes. And you can, you can have these rules live outside, you know, you can use whatever AWS, Iambda or whatever it is.
And then you just provide, you define beforehand, what their group of users what permissions they have. And then you implement that in your cluster as an accumulated resource.