Home
Resource library
Webinars
Making Peace With The Grim Reaper – Liveness & Readiness Probes Done Right

Making Peace With The Grim Reaper – Liveness & Readiness Probes Done Right

Anaïs Urlichs

Developer Advocate, Aqua Security

Guy Menahem

Solution Architect, Komodor

You can also view the full presentation deck here.

[Transcript]

Anais: Hi everybody, hello. Thank you so much for joining this live stream here on the aqua open-source YouTube channel. For those who are new on this channel, my name is Anais, I’m the open-source developer advocate at aqua.

And usually on this channel we have content related on our open-source projects, this includes Tracy, Starboard and Trivy. But today, we have a special guest, with a presentation that’s not specific to our core, to our open-source project. So welcome everybody Guy to the live stream, hello.

Guy M.: Hello, thank you. It’s good to be here.

Anais: Give an overview of who you are.

Guy M.: I’m Guy from Komodor, and today we are going to talk about liveness and readiness probes, and it’s going to be exciting and interesting. I hope so.

Anais: Awesome. So we have a presentation, right? And then we have a live demo, as well. So what we’re going to do basically, so it’s easier for people who are watching the recording afterwards, to follow the presentation and basically be able to also skip to the presentation, to the parts they are most interested in.

We’re just going to run through the presentation like all in one. And we have two points where we stop to invite you to ask questions, to take questions and to have a conversation about the content. And then we’re going to have the demo also all in one, and then we’re going to have another Q&A and more conversations.

So yes, feel free to interact in the chat. I’m going to be in the chat doing an entire presentation, so you can also just ask questions there and we will just take it at those points when we open up for questions, let’s say. Great, you want to share your screen.

Guy M.: Yes, sure.

Anais: Awesome. Welcome Indian, thank you so much for joining. He’s everywhere. Okay, awesome. I will put your screen up, great.

Guy M.: Yes. So today, we are going to talk about making peace with the Grim Reaper, and liveness and readiness probe and how to make them right and work for you. So basically, Kubernetes is some kind of Grim Reaper, is always looking to in one way to kill our pods, and the other way want to keep us safe and keep this balance of application availability, and get the souls of our pod.

And we really want to make sure that we know how to interact and how Kubernetes is going to act, and for some time, we want to keep our application safe, and on the other time, we want to sacrifice our pod and if you want to learn how to make sure you know how to act, configure, and code this live surrender squad, you can stay with us today.

So what we are going to cover today is what are probs? What can possibly go wrong? And why should you care about probs? Best practices, as I said before, we are going to show you like a live demo of how you can testing your own the readiness probe, the liveness probe, and play a little bit before you are going to implement them, it’s kind of educational tool for you.

So first of all, a little introduction about myself. I’m Guy, I’m solution architect, before that I was a developer, and I’m around the DevOps area for a long time, I think something like 10 years. And I’m working at Komodor, we are startup building the first Kubernetes native publishing platform, which means I get to listen to many customers issues about troubleshooting, something not configured well and we have a lot of knowledge that we want to share with you today, of how to make them right.

I really love infrastructure, and one thing that go with me for a long time is that I want to sleep well. And if you want to sleep well, we want to plan well and make sure everything is just in place. So when talking about probs, probes are a way of Kubernetes to determine our application health, if it’s live or if it’s ready.

And based on that, Kubernetes knows how to act with our application. It means that he can for some reason stop the traffic to it, and in some places, it can restart the container or the pod, to make sure if something can be recoverable by restart, we can do it pretty fast. How it’s done, like robs, the Kubelet is actually periodically going into our each one of our containers inside the pod, and ask them are you alive? Are you ready?

And if the answer is going to be no, the Kubelet is going to or restart our container, or stop the traffic, or do some weird things to our containers that we don’t want to. And we definitely want to understand this process, and understand how it works, and make sure everything is fine. So we don’t want Kubernetes to kill our pods. And we have like two types of probes, or three types of them, but we are going to focus only on two of them today.

First of all is the lime scrub liveness probe, liveness is a way to tell Kubernetes that our pod or containers are alive, and we don’t want, or we want to restart it. A readiness probe is more about when we want to send requests to that pod, it’s the readiness status that you see in most of the Kube CTL get commands, if you’re going to use a deployment or specific pod, and last of all we got the starter probe, starter probe are usually a way to make sure your application ready after the initial setup.

We are not going to cover you today, and if you really want to deep dive into that, you can send me a private message in LinkedIn, Twitter or anything you want to cover that.

So we understand in general what are probes, that the Kubelet is going to periodically ask on containers, but how it really looks like under the root. So every amount of time the Kubelet is going to ask our container you’re live? And if we are in the epi flow, our content is going to answer yes.

And the next thing is the Kubelet is going to ask the container are you ready? And if the answer is yes, the Kubelet is going to mark this pod as ready or the container is ready, and if the all containers in the pod are ready, Kubelet is going to route traffic to it. And this is the epi flow, usually you are in the epi flow and this is where you should be most of the times. But what happens when something goes wrong? What happens when your application is misbehaving or one of the services that you’re working with is down.

So then the Kubelet is going to ask the container are you alive? And if the answer is going to be no, the qubit is going to restart this container. Actually, when you do KubeCTL get pods, and you see the restart count, this is the restart count of Kubelet researching your containers. And basically, it’s not a restart to the old pod, it’s just restart to that container which is a little misleading when you take a look about the output of the command.

And the second thing is after you, Kube ET’s mark your container as alive, then you mark it, ask the container are you ready? And if the answer is no, Kubelet is going to mark this container as not ready, all the pod is not ready. And the Kubelet is not going to send traffic to this pod.

And this is the bad flow, and we want to understand it pretty well, because we are going to use that life cycle and understand of how the Kubelet is going to interact with our container and pods, and take advantage of this kind of usage of Kubernetes.

So you may ask yourself why should you care? Why should you care about probs? What should you care about this kind of under deep dive and flow? So first of all, all these things have a direct impact on your application availability. If for some reason, your container received traffic and Kubernetes decide to restart it for some reason, then a user are not going to get a response.

And if your containers are not ready, users are not going to be served. So any type of the flow, when something is going wrong, you have direct impact on application availability, and you really need to take care of that. And the second thing is sometimes things are going to go wrong.

There is no pods that are always alive and always ready, something is going wrong and you want to have these bread crumbs and hints to make it easier for you to troubleshoot. Make sure that you have all the ins, all the things is in place, and when you get into application availability issue or incident, you have this kind of tools that help you to understand the troubleshooting.

And when you’re talking about what can go wrong, we kind of divided it to like three buckets, and three section when something is wrong and how you know how to act in case of that. So the first thing is undetected downtime, the application is running, everything seems ready, Kubernetes routes traffic to it, but actually the container and the pod are down, returning errors all of the time.

So we can take a look about the example just below, the user is going to ask Kubernetes to receive any kind of web page and Kubernetes is going to ask the container, are you running just live and ready together? And let’s say this container is faulty, and the container is going to respond as a yes. So Kubernetes is going to actually route the traffic to it, which obviously what we want to do.

But when the request going to enter that container, the container is going to return like 500 error, which we obviously don’t want to get into the users. And the user is actually going to exit the system. And we call it undetected downtime because we have a downtime in our container, we have Kubernetes ask it are you ready or live? But at the same time, users are going into that container.

And we call it that you forgot to call to the Grim Reaper? Because eventually, when something is faulty about this container, you want to sacrifice this pod and make sure no request going into it, and you want to make sure that the Grim Reaper comes in time, you call it in time, you preserve the balance of your healthy pods in the cluster and your application is running well.

So this is the first thing, the next thing is more about unwanted downtime. Which means that the application runs well, but for some reason, Kubernetes restores continuously, think it’s not alive. Or for some reason, doesn’t send traffic through it. Why is that so important? Because unwanted downtime means that your application is fine, but Kubernetes doesn’t think it’s fine, and is going to act on its own automation, to help you fix that and actually it’s going to crush your application and your users.

So at the time when your application runs well, you want to make sure that all the probes are going well. And at this time, your pod is actually fine, but you sacrificed it to the Reaper. We obviously don’t want to sacrifice any innocent pod to the Reaper, and you want to make sure that any liveness and readiness probes are matched our result and actual state.

And the last thing is about unexpected behavior. It means that we have some kind of behavior that we planned to make, and we configure it well or not well, or we have the code to match his state. But for some reason, it doesn’t behave as we expected. And unexpected behavior like we think that we are ready, but actually, we’re not.

Or we are alive, and actually we’re not, it means that our application is not consistent, and it’s hard for us to predict what is going to act, so you actually don’t know where to call the Reaper. And if you don’t know when to call the Reaper, it means that you don’t know when to sacrifice and you don’t know when to interact and when you should make the Reaper keep your soul and your pod soul safe.

So in that case, you definitely want to know what is the behavior, how to expect it and how to act with it. So all the three things are undetected down time, unwanted downtime and unexpected behavior. So now, it’s like question time. Do you have any questions from the audience?

Anais: Let me just come up. There are no questions yet. If you have a question, feel free to post in the comments, such as Engin was sharing his experience. Setting liveness probes, equal to readiness probes, and yes, if you just have to subscribe to the channel, if you want to get involved in a chat.

And subscribe any amount of time and then you can get involved. Yes, feel free to post any questions that you have. Maybe you can just elaborate, what is the problem in the case of Engin’s example, what will happen if you set liveness probes to readiness probes?

Guy M.: What happens if we set both of them to the same endpoint?

Anais: Yes, just to elaborate for everybody.

Guy M.: Yes. So actually, one of the best practices that we are going to talk about is what happens when you actually configure both of them to the same endpoint. What is the problem with that, is that you can determine the different condition between are you alive?

Are you ready and are you both of not one of them? And it means that Kubernetes is not going to act as you expected it to be, so always keep both of them in a different endpoint, and we are going to talk about how you should configure it well, to make sure you distinguish between the states.

Anais: Awesome. So yes, if you have any questions, feel free to posts. Are you actively using liveness and readiness probes, because for a lot of my demos, I don’t do it. Yes, we have all been there, Engin, but we’ve done something like that.

For like the first three months, learning about Kubernetes, I didn’t know the different types of services, I didn’t understand the differences, like some things take time more than others, but yes, awesome. Also make sure to follow Komodor on Twitter, and yes, let’s jump into the live demo.

Guy M.: We will cover I think the best practices first.

Anais: Okay, there’s more in the presentation. Okay, let’s go back to the presentation.

Guy M.: Yes, cool. And if you have any question later, you can always DM, ask us on Twitter, on LinkedIn and we will happy to help you to get over this liveness readiness probe challenges, and overcome and make sure that you are fine with that. So when moving to best practices, is we want to really make sure that you have all the tools, and the cookbook to make sure that your liveness products configure properly, and you can use them well.

So we want to take advantage of this lifecycle, this lifecycle is not that simple, but not that complex. But we want to make sure that we know, understand each step of it and how to use it. So there are like three steps that we really recommend to use. The first thing is about relief. If you get any request to faulty pod, it means that something is wrong. So first of all, we want to cut off all the requests, and make sure the readiness probe answer as fast as we can, and the pod market is not ready.

The next thing is we want to maybe remediate the issue, we want to use some automatic restarts or some mechanism that is out of the box to make sure that maybe a restart can fix it. It’s better to wake up in the morning and see that your container is making a lot of restarts, than wake waking up in the middle of the night, and see that all the containers of your application are not ready.

So it’s better to, sometimes it’s actually not going to help you, but when it’s going to help you, it’s going to save you for a long time. So you want to remediate the issue as fast as you can.

And the last thing is about investigate, you really want to make sure that you have all the breadcrumbs and hints to pinpoint to the root cause. So if you know the state of the application, and for some reason, you code it, and understand what is the state and you know what is wrong, you really want to make sure that after something is wrong, you’re going into the system, you’re trying to figure out what is wrong, so you have all the things in the log to pinpoint you to the root cause.

We find it like making the difference between applications that are down, error code sending to users, and they reduce the time for MTTR for users. So if you relieve remediate investigate, it means that you already solve the issue or you understand what is the issue pretty well and pretty fast.

The next thing is more about adapt the probe to your service and system. So there are no two systems that are alike, and it means that there’s like kind of question and controversial question about I have like only web requests, or I have like a worker type application, we are event driven application, how can I use the probs?

In my opinion, like there is a usage and implementation of probs to any kind of application, you just need to adjust it to yourself and you really want to take advantage of that. So if you are using like different type of application, don’t think okay, the probes are more made for web application, or web services, and my application is a little bit different. Think how you can adjust it to your own usage. There are like many examples that you can use for.

So this is the first thing about the recipe, and how to think when moving to success. The next thing is really plan what you want to use. So liveness probe as we said is determine the life of the pod, and when we are alive we want to make sure that everything is load and runs fine.

And why is that so important to understand the state during the live, because if something is not loaded, if one of our processes of our application is down and not answering to any request even internally, we want to make sure that Kubernetes restart and our container is going back alive again. And the next thing is that we want to recover.

So if we are live and we want to move into readiness stuff, we want to make sure that we can recover or we are going to restart. Why is that so important? Because there is transition between like liveness and readiness, same time your application is not ready for some time, going into only live but not ready, and it’s going to recover eventually.

Especially for example if you are using external services database or queue, you can connect it for some time. It can be a node issue, it can be a networking issue for some time, it can be even a database issue. But you want to make sure that if you are not ready, because you could not connect to that database EQ, and eventually moving into live, the liveness probe will retry to connect, and you can recover from that external failure, or external connection to database.

Because that will make your application really reliable, and make sure that you can leave it as it is, even if there is disruption in one of your services, it can recover very fast and easy. So you can understand that any state can be recoverable, and there are transition between the states. The next thing is about readiness, you want to make sure that you can serve the users. So a user can be end user, someone at home, application, mobile and it can be a different service that use your service.

And you want to make sure that if you cannot serve these users, do not answer or return all to the readiness probe, only you can understand what is the business case, what the service should do, what is the main job of the service. So only you can know what can be a good serve for your users, so you need to plan it in advance and understand where you should fail and when you should not.

The second thing is about external failures as we said previously or from database and queues, you want to make sure that you immediately fail when you cannot connect them, because you’re going to still connection for a long time or timeouts, you are going to impact your application availability, and any kind of external failure that impacts your errors means that something is wrong.

So make sure you fail on external failures. So if you plan it well, I think that you made the main big progress and you understand what you want to do, you have the imagination of what is a good plan, and how we can take advantage of this life cycle. And then we need to configure it.

So a configuration and readiness and liveness probe are pretty easy and straightforward. You can see an example on the right side, which is for very long initial delays and thresholds. So don’t use that example, but use this structure to configure yours. And you can configure a lot of things like threshold, delays, period per request. You can use http get, but you can use also GRPC or TCP connection.

This is very extendable, and you want to make sure you configure it, and you can figure it based on your business and service needs. So if you plan it well, you need to immediately understand what you need to add into that configuration. And the last thing is that we want to add specific endpoint for each probe.

So someone asked about that, and distinguishing between the state it’s very important. Why? Because if you like mix two states, it means you’re missing something about the configuration and the life cycle of Kubernetes and your pod.

And as we said previously, if you really want a Grim Reaper or Kubernetes to help you and take advantage of this life cycle, you really want to distinguish between the states, and making each endpoint that have its own state, have its own configuration, have its own abilities, really makes the difference and help you understand what’s wrong, and help you avoid errors and troubleshooting.

So this is about configuration, we are going to show in the live demo like a sample of that, and it’s really easy to configure it. Just make sure that each configuration is per container, and not per pod. So if you have like a multi-container pod, you need to make sure that you have these two end points per container in each pod.

So after we’re done to configure it, or in the meantime, we really want to code it. So we are going back to the plans, we are going to understand what is the state? What the state that we want to answer or if for each kind of prob? Is the service alive? Is the service ready? What is the main job that can be done? What are the external services that we are using?

Maybe we have like some free external services like database queues configuration management, and I don’t know, authorization management that we can work without them. And maybe we have like two services that we don’t need, for example Metrix or some kind of user tracking abilities, that if there is some failure there, we don’t want to fail our service and not respond to the user.

So basically, we want to understand the state, code the understanding of the state, and really make sure we distinguish between the states. The second thing is that we want to auto recover, or do we need a little push. So Kubernetes and this life cycle of restart, we can take advantage of it, and we might use it for extra push from Kubernetes.

Or maybe we want to code like a lot of free tries, auto recovers and make sure we maybe exit at some point, and like force Kubernetes to restart us. Depends on the configuration, but you need to understand if you need a little push or you cannot recover.

And the last thing about coding is that you want to log everything, this really what makes the difference when you have troubles to make sure you troubleshoot them faster, solve it faster, and you don’t spend like two hours troubleshooting and having that spend of time, working with a lot of people to understand the issue. Is it an application issue? Is it an infrastructure issue? That takes a lot of time.

So logging can really make the difference between the states and save you a lot of time. And that’s it, so we talked about first of all what is the cookbook and what we want to think. What are the plans, and how we should plan the [Inaudible 00:27:32.16] how we can configure them, and last what we want to code. So do we have any questions? Someone has like questions?

Anais: Well, we have one question by Engin actually. What if you have to call expensive endpoints, like endpoints? Such as if you have to check that everything is working correctly with your database? I think that’s how I understand the question.

Guy M.: Do you mean like what do I check if my database is wrong right now?

Anais: I think if you can do maybe, the way I understand it maybe if instead of checking the life endpoint, maybe if there are alternative ways of checking if the service or container is up and running, is that how, and maybe you want to elaborate on it. And then what is the sweet spot between waiting for the next call on failure and delay threshold?

Guy M.: Okay, yes. So something I did not mention, but it’s good to understand that liveness and readiness probe going to be, Kubernetes is going to use them at the same time. So it means that your container is going to get request for the liveness endpoint and for the readiness probe at the same time.

And what you usually do is make the readiness probe threshold very fit and very low. Because you want to make sure that if you got failure and you understand it in like one second or two seconds, you don’t want to wait five more seconds to all your users get like errors. So usually, the threshold for readiness probe will be very tight. For sure, it depends on your service and your application, but it will be very tight.

And the liveness probe we might give it some more time, because the liveness probe usually impacts on the workloads of the application. And if you like can keep it much longer, just to make it auto recover, because auto recovery is usually faster than startup, and it’s more safe and there are some things in the cache. So you want to take advantage of that also. So usually, liveness probe will be much larger.

I don’t think it should be like two minutes liveness probe. So I think you should tie the readiness for a few seconds, make sure the liveness is a little bit longer maybe 10, 20, 30 seconds depend on your use case and your application availability needs. And I think that can be a good fit for most of the applications.

Anais: And then, Engin elaborated on a previous question. Some calls for readiness can be costly, for example, on the database. So when the database is down, I want that the readiness probe fails too. So?

Guy M.: Yes, sure. If I understand the question correctly, like if you get into the readiness probe, you really want to make some calls to the database, actually, active call. If you can enable to make that call, it means that you have like an end-to-end test for that database connection. It can be even a simple like a demo table, you just select all the values from it.

If you got like the return back to you, it means that the database is working, you have the connection. If you want to be like more tweaked, you can actually query the tables or the database that this server is responsible of. It would be like better. And then you can actually understand during the readiness probe, what are the type of the application and how the database failure can impact your probes. I hope I understand that, if not, we can talk afterwards.

Anais: I would assume just to add to that, that in those specific scenarios, like when you want to like really ensure that your database is working, you wouldn’t want to do that through readiness and liveness probes. Like I assume that your database it’s not a stateful set and Kubernetes is necessarily, like for most people, it’s an external data source. So if the container that you run within Kubernetes calls that external endpoint or like of the database, the connection, makes the connection to the database.

And then this covers oh, there’s nothing returned, I think that it should be, I don’t know like for through Prometheus or another alert management tool beforehand, there should be more vigorous scans made than through liveness and readiness probes. Those would be my thoughts, but yes. Awesome, that’s a great question, thank you for asking. If anybody else has a question, subscribe to the channel and then you can hop into the chat and join and share your thoughts or questions. And now we have a live demo, right?

Guy M.: Yes, sure.

Anais: Awesome.

Guy M.: So a few things about the live demo is what we are going to show you, is some kind of educational tool that you can use test your liveness probe, and just play and understand how it works. Because usually, it’s hard to get that from your application. So we have this repo, I pre-made a repo that is open source so you can feel free to use it, we are going to send the link to the chat.

It’s a really simple application that only what it does is answer liveness and readiness probe. And we can actually turn off the probes of the application from our terminal or from external endpoint. And what I really want to show you is how you can play with it. So I have here the repo cloned, and if you take a look, we have like a docker file, it’s a simple application with docker, you can upload the image to your cluster to your registry.

And then we have like two deployment, one without the probes and the other with the probs. So taking a quick look about how it looks without the problem, it’s a really simple application, and there’s no really need or some advanced configuration in here. What I’m going to show you next is that you can take this yaml, and take it to this open-source project called valley tube by Komodor.

And what valid Kube can do is actually help you to validate your yaml. It can be validation if the yaml is correct, clean it, make it secure and audit, I will show it shortly. So we can see how yaml is fine, it’s pretty clean because I did not use any export. We have this 3v, which an ace you help us like to integrate into. That here we can see like all the security aspect of this deployment.

So I made this simple, and not secured deployment for the demo. But you can test your yamls into, put your yamls in here, and make sure your yamls are fine, secured. Anais, if you want to add something.

Anais: Just something to myself, yes. It’s a really great tool, valid Kube where you can use it kind of ad hoc to scan the containers in your Kubernetes manifests. So if you don’t want to go ahead and run 3d commands in the CLI, or you want to use one of the other tool, scanning tools, through valley Kube, you can do that pretty quickly, so you don’t have to use the tools separately.

And also to play around with and get started with tools such as Trivy. Yes, it’s a great tool. I put it in the chat, so you can see it there as well. And Komodor is going to post the repository that you referenced.

Guy M.: Yes. And feel free to start it, and add, I know some people made a PR for it, and add new tools into that, so it’s pretty cool open source and community-based source. What I’m going to do next is actually apply the deployment without the props, okay. So we apply it, I’m going to open port forward into it just so I add a connection to the cluster to the pod, and so we have like pod forward, and now we are going to test the liveness probe.

I will go with the watch one, so here we got a test of the liveness probe of that container specifically, and here, we are going to watch the readiness probe. And we can see that both of them are fine, the liveness is okay, readiness is okay. And we can take a look in Komodor and see that we got this new service in the system.

We can see it’s healthy, it means it’s ready and alive, we got all the replicas ready. And we also have these best practices section when we can actually see how we miss the best practices of this service. And when clicking on it, we can see that we have a critical issue that the liveness probe is missing, and we also got the running with the readiness probe is missing.

And so we understand that this service is not following best practices which is really bad, and the next thing is I want to do is just to turn off one of the probes. So first of all, I will turn off the probes, and as you remember, I would go with the ready first.

So immediately after turning off the readiness probe, you can see that the readiness probe started to fail. What we might guess that at this point, if we would have probes, the container would not be ready, and the pod would not be ready. But because we did not configure them, then when we are going to check the pod, we can see that it’s still ready, but doesn’t serve any user.

What we’re going to do next is actually turn off the liveness pro, which means the application is really broken. And turn off the liveness probe, we can see that the liveness is like in error also. But when taking a look about, they put back again, it seems like they are still ready and there are no restart on the restart count.

And we can check the events to see if there are liveness around this probe. But because we don’t configure them, it means that right now application is down, seems healthy to Kubernetes, seems healthy in Komodor and we really don’t want to be in that state. Because we have like undetected downtime as we talked before. So the next thing that I’m going to do is to show you the deploy with probes.

And deploy with probe is the same image, but what we did is actually configured the readiness and liveness probe with some very long threshold just to make sure this demo is not going, it is going like smooth and we have time enough to show you the differences. But when configure it on your own, you can always change these values. And so what I’m going to do next is apply it into the cluster.

Okay, so I need to switch the pod forward to the new pod, because a new pod spawn and now we are using the new one. Okay, so this is the new one, we add the port forward, we can see the live and readiness going okay. We can go back to, Komodor and actually see that. First of all, let’s click the pod, we can see that it’s already, running for not a long time. Going back to Komodor we can see that it’s healthy, number of replicas is one of one.

And we reduce the number of critical and warnings of the best practices. So if you remember, we saw before that there was a critical issue because of readiness and liveness probe. And right now, both of them are configured, so we are following best practices, and we know that we are more safe than before. And what you, I guess you ask yourself what’s going to happen when I’m going to turn off the readiness probe? So first of all, what I’m going to do is turn off the readiness probe, just clear the screen, I think it will be much easier to see.

Okay, readiness probe, we can immediately see that the readiness probe failed. We can see that the liveness is probably still fine. And when we are going to use Kube CTL get pods, we need to wait some time because the threshold is set very high. Okay, and now you can see that our pod is not ready but running, there are no restarts, and readiness probe is down. Right now, Kubernetes is not going to send traffic into that application if you would use a service with it.

Because I’m pod forwarding directly to the pod, so I have like access to both of them. But if you have like five replicas of the same pod, and only one of them would have ready, and I will examine the readiness probe directly from the service, I would not get this error at all. It means that I communicated with the Grim Reaper, we make sure that this pod is not part of the system and everything is fine to our users.

So going back to Komodor we are unhealthy, out of zero replicas, we have this health event going on in the system, with not enough ready replicas, a readiness probe might fail. Okay, now we are fine. So that state can be cool for us. But what’s going to happen if I’m going to turn off the liveness probe also.

So turning off the liveness probe, and now we can see we got here on the liveness probe. So after some time, as we said before, Grim Reaper is going to come to that pod, is going to restart it and kill it with sick term. And after that, killing is going to be restarted, and from that point, the service is going to be back, ready and alive, so we just need to wait for the liveness probe threshold to run, and now you can see it.

Live, okay, readiness is okay, application is fine, users get traffic. And when clicking on Kube ctl get pods, which leader is a little threshold for the ready until the application is really ready. Sometimes we configure it a little bit longer in the initial delay, to avoid some warm-ups of the application especially after failures. And so let’s wait for it, but you can see that the restart count increased.

It used to be zero, we failed the liveness prob and now it’s one, and then readiness probe turn on back. And our application is healthy and ready. So it’s pretty fine to get that understanding. And what you should do next is actually you can go into the repo; you can clone it to your own.

You can use this liveness readiness probe them on your own, just to understand the context and understand the behavior, and then if you doesn’t have it in your application, then it’s a good time to add it and maybe learn a little bit more.

Anais: Stay there, go back to Komodor. I tried to jump in. So we have a question by Spacecadet, can Komodor let me know if I can misconfigure my probes? What are other best practices can I verify?

So it tells me clearly that if I don’t have readiness and liveness probes set in my configurations, does it also tell me if I’ve said something, like if my manifest is just wrong by itself? I guess that’s what it’s asking about, right? What they’re asking about. And then yes, maybe you can walk us afterwards through some of the other things that Komodor can do, since there’s interest.

Guy M.: Yes. So we have like many checks, especially if you have like not enough replicas. So you don’t have high availability in your deployment. So you already deployed just one pod. But if you’re running in production, you definitely don’t want that. We take care of your limits and CPUs, image pool policy, and latest tag or tags not specified which can arm your application, and making you run inconsistent images.

So it’s pretty simple, and now we are going to add like more checks into that, and it’s really good to, as a DevOps to understand that your Yaml are following best practices, and if you’re using some third-party tool, is also following best practices. And if you are SRE or some any other engineer responsible for your own yaml, you can use that to understand that you are more following best practices, and you can avoid errors and mistakes.

Anais: I quickly want to jump in, Komodor is great when you already have deployed the application, right? Like Komodor is tracking your running cluster, and it’s providing you with great suggestions on like here best practices, you shouldn’t just run one replica if it’s down, your application’s down, things like that. If you want to check your application before you deploy it, you can use Trivy from Aqua security.

Trivy is not only used to scan vulnerabilities within your containers such as done here, directly it’s scanning your manifest for vulnerabilities. But you can also use Trivy, and then Trivy config command, to scan your infrastructure as code such as Kubernetes manifest for mis-configurations.

So it’s telling you if you’ve misconfigured something, like before you deploy it, if the Yaml looks odd basically. So then you can deploy it and see like are there better best practices that you should add to it, right? Awesome. Thank you for that question.

Guy M.: There’s a link to Trivy, so anyone can use value Kube, Trivy in place or can download from the repo in here.

Anais: Amazing. If you have questions, jump into the chat and let us know and yes. Just have to subscribe, and then you can get involved in the chat, awesome.

Guy M.: Yes, sure. So thank you everyone for listening. So if you want to learn about Komodor, you definitely can go into our website or follow me or Komodor Twitter, and it’s good to have you here.

Anais: I just wanted to know, who came up with the metaphor? Like who came up with this? Because it’s brilliant. It tells such a nice story and I just wonder like did you come up with it? Like all kudos to you.

Guy M.: So actually, Woody Hoppers and I worked on this presentation, and my initial idea was this kind of doctor killing his patient, if they have like something wrong.

Anais: That’s dark.

Guy M.: That is the Grim Reaper.

Anais: Amazing. Thank you so much for joining, and thank you guys so much for giving the presentation. I know it was giving once before you gave it at a cloud native meetup, right?

Guy M.: Yes.

Anais: Now we have it on YouTube as well, which is amazing. And on the open-source channel, great. So if there’s still any questions, I’m sure you can reach out to Guy directly on Twitter. What’s your Twitter handle?

Guy M.: The_good_guym.

Anais: Let me post it, I have it here. Let me just post the link, so people can follow you on Twitter as well.

Guy M.: Yes, it would be nice. And if you have questions, like feel free to follow, DM. I really like to hear like people’s issue and help them if I can. This is why I do what I do.

Anais: That’s brilliant, awesome. Thank you so much for taking the time and coming on to chat.

Guy M.: Yes, thank you for hosting me, like it’s a great opportunity.

Anais: Yes, definitely. If there’s somebody watching who wants to really early present about a topic, could be anything related to Kubernetes, then please let me know and yes, we are happy to host. Amazing, thank you.

Have a lovely day everybody, and yes, let us know your thoughts on the chat if you have any other questions, comments, if you would like to see more of these kinds of streams. Yes, thank you so much, Guy, again.

Guy M.: Thank you. Thank you everyone for joining.

Anais: Thank you. Bye-bye.

Making Peace With The Grim Reaper – Liveness & Readiness Probes Done Right

Anaïs Urlichs

Developer Advocate, Aqua Security

Guy Menahem

Solution Architect, Komodor

Sign up for FREE