Chaos & Order: Breaking and Fixing Things in K8s Environments w/ Gremlin

You can also view the full presentation deck here.

Udi: Hi, everyone, thank you for joining us. So, we’re going to take a couple of minutes just to let everybody join in, and then we’ll begin.

Julie: Hi, everyone, I’ve been telling the team all about all the snow that we’ve been getting this morning. So, so far, it’s just been 2 inches. But it’s supposed to snow here until Christmas pretty much straight.
Udi: And over the Komodor side, we had a really sunny day.
Julie: Very nice.

Udi: So, we are all about opposites today. Okay, so I think we’re ready to start. So, hi, everyone. I’m Udi, DevRel & Community Manager at Komodor, and welcome to the Chaos & Order webinar, where we’re going to break things and then fix them. So, before we began, I’ll just introduce our 2 speakers and breakers and fixers for today. From Gremlin, we have Julie Gunderson. She’s a Senior Reliability Advocate. And from Komodor, we have Rona Hirsch, DevOps and Software Engineer currently at Komodor. And what we’re going to do is we’re going to showcase Gremlin’s chaos engineering platform, and then Komodor’s Kubernetes troubleshooting platform. And then get to the fun part where Julia will inject chaos and break things in front of your very eyes, and then Rona will use Komodor’s platform to quickly find the root cause and fix the issue live. And we’re going to begin in a minute. If you have any questions, we’re going to have a short Q&A at the end of this webinar. So, drop your questions below and we’ll get to them at the end of the webinar. So, Julie, Rona, take it away.

Julie: Alright, Rona, I’ll pass it over to you.

Rona: Okay. Thank you.

Julie: You want to walk us through a little.

Rona: So, hi, everyone, and welcome. I’m Rona. I’m a software engineer at Komodor. And I’m going to show you a quick overview of our SaaS-based Kubernetes native troubleshooting and observability platform that enables both developers and ops teams to independently and efficiently troubleshoot incidents by tracking changes across your entire Kubernetes landscape.

So, let’s talk first about, why is it so hard to troubleshoot? So, first of all, there are a lot of blind spots. Right? I mean, changes are often unedited or done manually, and you can’t really know who did what and when. On top of that, your data is fragmented all over the place. I mean, once you have an issue, you have to go and look at your login solution and your monitoring solution, your CI/CD, your repos, your feature flags, and so on. And you have to gather all the data from all these tools and make sense of it all. And that’s not easy.

And also, there’s always the butterfly effect that, at the end of the day, 1 minor change in 1 service can affect other services dramatically, and you can’t really pinpoint where it’s coming from. So, that’s the complexity of troubleshooting in Kubernetes today. So, what troubleshooting in Kubernetes looks like today is that, whenever you get an alert from your favorite alerting tool, like PagerDuty. You have to go through all of these different tools. You probably have a lot more logos in your stack that fit into the slide. But and you have to spend so much time to resources just to answer one simple question, “Who changed what and when?”

So, that’s where Komodor walks in. And Komodor just collects all changes from across your system from every tool in your stack and gathers it all in one place and makes it very intuitive and easy to understand and really helps you find the root cause across all your systems in a very simple and easy way.

So, how does it work? Just Komodor collects all your changes across your system and presents them in an intuitive and easy-to-understand timeline, and which enables you to draw contextual insights and pinpoint the root cause.

So, we have a very easy pod-based agent installation. It takes you just several minutes to install it, and you get out-of-the-box value, and you can see all your services and all your stack in just 1 simple installation.

Okay, so now we can just… Julie, maybe I can just show how it looks like on my end.

Julie: Absolutely. Take it away.

Rona: Thanks. Okay. So, here you can see the main services view where we can see an aggregated multi-cluster view of your Kubernetes resources. And here, I have only 1 cluster, but you can install several clusters and it will appear here on the left side. And here, we find services such as Deployments, StatefulSets, and DaemonSets. And for each service, you can see its health status and its replica status. You can filter out, use these filters on the left side and navigate through your services using these filters. And you can filter by namespace, by cluster, by health status, by kind, and additional filters that you can customize on your own just with a simple configuration.

So, let’s drill down to a specific service. Well, for example, we’ll take the balance reader. Okay. So, this is the service view where you can find all the data needed to troubleshoot and determine the root cause of an incident. Here, you have 3 main components. So, the first 1 is the service metadata here. The second one is the timeline of the events that happened. And the third one is the related services section, where you can see what really affected what can be related to the service and what can affect an issue on this service. Sometimes we don’t really know what’s connected to what, and which service is affecting the other one. But here with Komodor, we can just use the related services and see an aggregated view of the events that happen on several services. Such as we see here, we have several services, and all these aggregated events that happen on each service.

Okay. So, from here, I want to present the events view. Okay, in the event screen, we can see events across all of our clusters and correlate between timelines. This is useful for catching events that aren’t mapped to a specific cluster and often fall between the cracks. So, you’ll see all of the events that happened on all of the clusters on all of the services just chronologically stacked together on top of each other. And you can really make sense of it all here. And you can also filter the same with the services view. As before, you can filter out the events that happened using the filters here, or your own customizable filters.

So, let’s go over to an example of 1 event that happened. So, in this event, we can see that there was a health change event. And the reason was that there were not enough ready replicas. There were supposed to be 1, but there was only 0. So, we got an alert, and we got a health event of not enough healthy ready replicas. We can see here when the event started and when it ended. We can also see that the notifications were sent on a specific channel that I defined before. We can see the deployment that was related to it. And we can also go over and view the pods and logs status.

Okay. So, here we have the pods and logs screen. And it shows that we have 1 out of 1 pods that are ready. We can see it state that it’s running, we can see how long it is running for, the number of containers, and then the node that it’s running on, and its name of course. As we can see here, that’s the same as if I were running Kube CTL described pod and specify the pod name. I can see the ‘describe; of the pod, everything that is in the describe command. And I can see the events. I can see that it is running. And I can see also the pod logs here.

So, okay. So, I can see the logs here of this pod. I can see all this stack trace here and all of the application logs. So, that’s a quick overview of Komodor. And Julia, we can go back to you.

Julie: Excellent. Thanks, Rona. If I can remember to take myself off of mute. Let me go ahead and take that screenshare back. So, can you see my screen right now?

Rona: Yes.

Julie: Well, excellent. So, when we talk about chaos engineering, Gremlin is a platform that allows you to practice chaos engineering by safely injecting failures into your systems, so that you can understand how they’re going to behave. And look, failures are everywhere, you can pick a date and search for outage, and you’re going to find something. As a matter of fact, if you pick a date as of like, I don’t know, last Tuesday, for example, we would have seen major outages everywhere due to that AWS, US East one incident that we had.

So, failures are inherent to complex systems. And when we talk about complex systems, our systems have completely changed from the monolith to the microservice, when we moved from on-prem to the cloud. And that brought some great things like the ability to scale. But one of the things, the drawbacks to that is it made our system so much more complex, we needed new ways to build and test our applications.

And so, when we talk about what chaos engineering is, what Gremlin is, we talk about chaos engineering and the fact that reliability is no accident. It’s a practiced way of making sure that our systems are resilient. And chaos engineering brings that together with thoughtful and planned experiments that reveal weaknesses, both in our systems, and within our human systems as well.

And we want to look at maybe like where is our tech broken or insufficient? Like, does the user experience break? How’s monitoring and alerting working? Does auto-scaling work? And for human systems, let’s look at our alert rotation. Where our alerts working? Were the documentation and playbooks up to date? How did the escalation process work?

So, with all of that, when we talk about chaos engineering, one of the things that we’d like to talk about here at Gremlin is the fact that you’re following the scientific method. So, you’re observing your system. You’re creating a hypothesis. You’re saying that, “If I do this, we think this is going to happen. We think this is going to be the end-user experience.” And then you’re testing that hypothesis. Then you’re looking at the data that comes back to you, and you’re analyzing that. And then you repeat. And you’re making sure to share the results of those experiments out to the rest of the organization.

When you practice chaos engineering, there are a few things that you want to do. You want to start small. You want to be careful. You want to start in a single service or host, not necessarily the whole application or fleet, and in a controlled environment with a ready team, and with tools that are going to allow you to see what’s going on and to be able to troubleshoot that. And that’s what we’re going to show for you later on.

Once you’ve started small, then you expand that blast radius. You also want to make sure that you can safely schedule or automate your chaos engineering experiments. And so, that you can expand that blast radius out. Now, we’re not going to show that to you today, because we’ve got some fun attacks that we’re planning on doing. But you can definitely learn more about scenarios and status checks over at gremlin.com.

When we talk about adopting the practice of chaos engineering, you can adopt the practice in development so that engineers are architecting for failure, and you get competent testing and development, then you move to staging. You start small and staging and then expand your blast radius. And then finally, you can move into production, and you can start small, and then you can increase. And if you think about it, it’s very similar to how you do development. So, you don’t need to overthink it. You want to work iteratively like you do with code, move up the environments like you do with code. You already know how to do this.

And so, I just want to cover kind of some of the different types of failures that you can run. And we’re going to go over some of these today. So, there are failures that you can start with such as resource failures so that your test can progress as well. So, you can start with CPU, disk failures, memory, and IO. I mean, that’s what the cloud was built for. Does your auto-scaling work? Then you can also look at service failures. So, we’re going to run a few of these today. I’m really excited to see what happens. But there are process killer attacks or host pod and container shutdowns. You can even run attacks that skew your clock. So, the questions you want to be thinking about are, can your service restart itself without manual intervention? Is traffic automatically routed to the restarted service? You want to make sure your own stuff is resilient.

And then we can also look at dependency failures. So, these are tests like blackhole, DNS, latency and packet loss. So, we want to look at, what happens if a dependency is unavailable? What happens when the network is bad? Can your service handle things asynchronously? And then there’s also application failures. You can move up the stack to test the application. You can inject latency into the code. You can throw errors.

And so, when we talk about chaos engineering, what we want is continuous chaos so that you can have confidence and resilience to a particular failure mode. We want to automate it to prevent drift into failure. And so, today, let’s go ahead and kick off some attacks and let’s see what we will see in Komodor and how we can use that to troubleshoot. Now, I’m going to go over to Gremlin here. We’ve got Komodor loaded. And I think we should go ahead and start off with, what do you say, maybe a blackhole attack, Rona?

Rona: Mm-hmm.

Julie: Yeah, let’s do that. So, I had to ask you a question right as you were taking a drink. So, the blackhole that’s going to drop the IP packets at the transport layer. So, we’re going to let’s say, let’s blackhole the load generator. So, what you’re seeing right now is the Gremlin app. And there are many different ways that you can run an attack. So, we’ll do this a couple of different ways. Right now, let’s just go ahead and see. Let’s go to our services, let’s see if we see the load generator over here, which we could go that way. We’re going to go ahead and select our load generator. And we’re going to select Attack this Service. So, as you can see, we’re attacking the load generator. We’ve got 1 replica set, 1 pod, and we’re going to choose our Gremlin. And as I mentioned, we’re going to run a blackhole attack. So, we are going to block that traffic.

Now, you can see here, right here, that Gremlin is whitelisted that so that it can communicate with the control plane at any time. If that connection is lost, it’s going to abort the attack because we want to revert to safety. So, Rona, I say, let’s go ahead and run this attack for 120 seconds.

Rona: Sure.

Julie: And as we do this, I’m going to unleash our Gremlin so that we can see now what happens within our service. And we’re using the Bank of Anthos, which is an open-source app that you can find over at Google Cloud. And I’ll grab a link to that to share over it to you. So, as we can see, our attack is running. And Rona, did you want to talk about what you think is going to be happening within our Bank of Anthos app?

Rona: Yeah. So, I’m thinking that our Bank of Anthos app, we’re going to have issues with it. And we’re not going to be able to deposit funds or send payments. And we’re going to see a lot of critical issues in the Bank of Anthos applications. And if we can do this, if we can’t send payments and deposit funds in our bank account, so we don’t really have anything. Right? So, I’m guessing that that’s what we’re going to see.

Julie: Okay. Did you want to show our users maybe what, or folks on here what you’re seeing?

Rona: Yeah.

Julie: Alright. Let me stop the share and pass it back to you. And then, there you go.

Rona: Just a second. Yeah. Okay. So, as you can see, I just got an alert from Slack, alert from Komodor. And it says that I have someone did a deployment. So, it’s very suspicious because no one is supposed to work now, and there shouldn’t be any deployments made right now. So, I’m going to go ahead and look at it. So, there was indeed a deployment made here. I see that its status is completed. And I see that if I look here, I can see the event where it started and completed, it does match the time of your attack, Julie. And I can see here, the Kubernetes diff. So, I can see that the replicas were scaled down from 1 to 0, and also the resources, CPU limits have changed.

But this is indeed suspicious. And if I don’t have the load generator service running right now, I will be having an issue with my app. And so, I should solve it pretty quickly by just scaling up my service back to 1 instead of 0. So, that’s what I’m going to do now. So, I’m scaling up my app. It’s still 0 from 0, as we can see here. And it changed now to 1 from 1. So, I scale up my service, the load generator that Julia attacked. And if we go to our bank service, we’ll see. Let’s check. I can do a deposit. And we’ll see if I can send payments to someone. And I’m able to send a payment.

Yeah. So, that solved my issue pretty quickly. And it brought back the load generator service, and now my application is healthy again. So, Julie, let’s go back to you to your next attack.

Julie: Excellent. Okay. So, now that we’ve seen kind of what a blackhole attack might look like, let me pop over and steal the screenshare from you. Can you see my screen?

Rona: Yes.

Julie: Excellent. So, I think that we should go ahead and run a memory attack. And let’s go ahead and see what happens when we consume memory, let’s say from the balance reader and the front end. So, once I clicked on attacks, I’ve got my services and my infrastructure over here, I’m going to go over to Kubernetes. There’s a couple of ways that I can do this. I can either type in here, or I can go down to my deployments. I’m going to select balance reader and front end. So, as you can see, these are selected here. You can see with the blast radius, there’s 2 out of 30 targets. I’m going to choose the Gremlin. Memory is a resource attacks. So, we’re going to make sure that our system is resilient under memory. Now, Rona, as I do this, I think I’m going to run this attack for let’s say, 120 seconds. Do you think that works? 180? What do you want?

Rona: 300, I think.

Julie: Let’s do 300. Okay. And let’s get wild and crazy here. So, as I mentioned to folks, when you want to start small and expand your blast radius, but we’re not necessarily following all of our own rules today, because let’s consume 100% of memory in these 2 services. So, I’m going to go ahead and unleash that Gremlin. One of the things that you can do while this is pending and getting ready to start, one of the things that you can do is you can actually create a scenario. So, let’s say you start by consulting aiming 10% of memory, then 15%, 20%, you can actually work your way up the stack all the way up to this 100% of memory consumption. But because we have a short period of time today, we’re just going straight for it. Now Rona, what do you think is going to happen?

Rona: So, since you attacked the balance reader, I think that I’m not going to see any balance on my bank account.

Julie: Well, why don’t we go ahead and let you show us?

Rona: And it’s going to be empty.

Julie: Alright, let’s let you show us what it looks like.

Rona: Okay.

Julie: And while Rona is pulling up that screenshare, for anybody that is interested in learning more about different types of attacks, we do have some free Gremlin certifications. So, you can go over to gremlin.com/certification. And we have a free practitioner and professional certification, which will teach you a lot more about all of this. And, Rona, that attack is running right now.

Rona: Okay. Great. Let’s see if we get something interesting here.

Julie: And then, on the Bank of Anthos, are we seeing any issues on that side?

Rona: Let’s see. Currently, we’re not seeing any issues. The balance is okay.

Julie: I’m wondering if we should…

Rona: I’m expecting to see an empty balance.

Julie: Maybe we should deposit some funds and see what happens.

Rona: Maybe.

Julie: How about like $1,000? Let’s just give us some money today. I always have way too much fun depositing money in here. So, it does look like the deposit was successful.

Rona: It is successful. And we still haven’t got any alerts from Komodor. So, let’s try maybe… yeah, the payment is indeed successful. So, my bank account is currently healthy and working and everything is good. So, I imagine it will take several minutes to see.

Julie: And that is possible. And so, things that we kind of want to think about too are what we call abort conditions over here at Gremlin too. At what point do we halt our attack when it impacts the customer experience? And what’s nice is that you can generally see that directly over. Now, I don’t know if it’s the snow today or what. We did actually run this attack earlier today, and we got some really interesting results from it.

Rona: Yeah.

Julie: But maybe the snow is… let’s just blame it on the snow, because that’s kind of fun.

Rona: Maybe the snow makes everything better.

Julie: I think it does. And also, Kubernetes is rather resilient too.

Rona: Yeah.

Julie: The attack is still running in the background. And it looks like you have some questions here. So, I’m just going to be your little moderator and ask you these questions. “Is Komodor able to do slight changes to remediate the detected issues?”

Rona: We’re not doing any changes to your application or infrastructure. You have to do it on your own. But we do show you the root cause and we do make sure that you understand the root cause really quickly and easily. Then you can fix it on your own in a very short time.

Julie: Excellent. Why don’t we check the Bank of Anthos again real quick and see while this is still running?

Rona: Yeah, it’s still healthy. It’s still working. Everything is working. So, we can try it 1 more time.

Julie: Yeah, it looks like it was successful.

Rona: Yeah.

Julie: Now, one of the things that we see too though is that the target container, it’s actually configured to be killed if it runs out of memory. So, that can actually affect the running of Gremlin as well. So, why don’t we try something a little bit different here? I am going to shut some things down. And I’m not going to tell you what.

Rona: Okay.

Julie: And let’s go ahead and see what happens on your end. Now, just to let you know, because we’re having a little bit of fun today, normally, again, we form our hypothesis, but I… and normally, we’re not trying to catch our folks off guard. So, I’m having a little bit of, I’m going to shut down some of our nodes. So, bear with me really quick. So, I don’t want to show everybody what I’m doing, but you can… in the Gremlin app, which you can get for free app.gremlin.com for the free trial. I am running a state attack, which is shut down. So, we’re going to be testing the resilience to host failures.

Rona: Great.

Julie: And I actually am rerunning that attack, because one of the things that we have in the shutdown attack is a reboot. So, that indicates the host should reboot after shutting down, and I forgot to toggle that off. So, one of the great things about Gremlin is there is a great halt button so you can stop your attack, especially if you are running up against those abort conditions. So, I halted it and I reran it, and let’s see if you can troubleshoot this.

Rona: Let’s see.

Julie: So, let’s give that just a minute. You could probably start to see something coming through here very soon. Looks like you have another question, “Is Komodor able to track changes that are not only code changes, for example, config/features?”

Rona: Yeah, we can track config changes and we can track… I don’t know if I showed you, but, on second.

Julie: Alright, Rona, I’m starting…

Rona: Okay.

Julie: Are you starting to get any alerts in?

Rona: No, I just wanted to show that we can show Git changes, for example. So, we can track down your repo changes. And you can see for each deploy which Git PR was related to it. So, in this case, this was my PR. And we also have a really cool feature of track files. So, we track important file changes, such as YAML file and docker files, config files. And you can really specify on your own which files are important to you. And once you have a change in these files, you’ll just see them here in the track files sections. Here we got some playing.

Julie: Okay. So, now you get to figure out what I did. That’s a lot of red, or pink.

Rona: Yeah, a lot of red.

Julie: Yeah.

Rona: Okay. So, I can see here in the Slack channel that I have a lot of services that became unhealthy, probably because of your attack. Yeah. So, I got a health event of not enough reading replicas, the same as before. It’s still not ready. I can see that I have 0 out of 1 replicas. And I can also see that a lot of services went down. Right? So, my guess here would be that your shutdown actually shut down my node, 1 of my nodes or hosts. That’s what I can figure out from this issue. I can see that it all goes back to normal. But since a lot of services went down, this is my hypothesis regarding your attack. And what I would do in this case is just maybe contact my DevOps and see what happened with my nodes. And maybe I’m running on some cloud provider, and there’s an issue with it. Maybe it’s AWS east or something. So, I don’t know. But maybe. But we can see that it all goes back to normal. And maybe we can run the previous attack again. Right?

Julie: Absolutely. So, you’d like me to run the shutdown attack again?

Rona: Yeah.

Julie: Excellent. So, I’m going to share my screen. I’m going to steal it from you really quick, so I can show folks what I actually did. So, what I did was I went and created a new attack. Now, here’s a great feature within Gremlin. I can actually rerun this attack directly from here. You can see the shutdown attack and what was going on with that attack. And I killed 2 out of 3 of your nodes. So, I’m just going to show how we start that attack from the beginning. So, I’d hit new attack here. I’d go over to our infrastructure. And on our hosts here, I’m going to pick exact, because you can go by tag, or you can go exact. Exact kind of easy because we only have 3. I’m going to pick these 2. And I’m going to choose a gremlin. So, I am going to run this shutdown attack. And this, remember I said we have a reboot button that’s automatically toggled on, so I had to start it? I’m going to turn that off. And I’m going to unleash that attack again for everybody.

As I mentioned earlier, we have a lot of safety built-in. So, if you see a degraded customer experience, you can hit the halt button here, the halt button here. And then we’re seeing this attack as it’s running through. So, the attack is currently in a pending state. And, Rona, where you started to see the issues was towards the end of the attack. So, would you like me to flip this back over to you?

Rona: One second. Yeah.

Julie: Alright. Let me go ahead and do that. There you go. And we did see that that took a moment for that to come over. We’re seeing some interesting things here on the attack. Actually, Rona, I am going to steal it back from you because I shouldn’t have given it away so soon.

Rona: Yeah. Okay. It’s okay.

Julie: So, this is what’s going on right now. So, as we’re shutting down those hosts, we can see that we were down to 1. We’re going to see this eventually hopefully bottom out here at the end of the attack. So, while we’re doing this and while everybody’s watching the fun here, another question for you, “Can I integrate my alerts into Komodor?”

Rona: Yes, you can. You can. We have a lot of available integrations at Komodor. That’s our actual strength. And you can integrate whatever alerting system you’re using with Komodor, you can send new alerts to Komodor. So, you have the integrations page here. And you can choose whatever alerting tool you’re using and send your alerts back to Komodor. And it will all show up nicely in the services and events view.

Julie: Excellent. Now you can see that we are down to 0 healthy hosts. I am guessing that you are probably going to get some alerts now. So, I will go ahead and stop my share and give it back to you. And while we’re flipping that over to you, you have one more question, which is, “Can you please explain how Komodor is different than Lens IDE?” I don’t know what Lens IDE is, but can you explain how it’s different?

Rona: Lens IDE is basically a tool for you for… 1 second. I’m just going to [unclear].

Julie: Yeah, sorry. I can re-ask you that.

Rona: Okay. Okay. So, we can see here that we got an alert from… we actually got the banner here saying that node, and node name is unhealthy. And here, okay, it went back, went back live again. But once I open the node’s view and the button, I can see here the node status. And 1 of the nodes, because you made a shutdown attack on one of my nodes, or 2 of my nodes, they were in an unknown status. So, you see here that the status will be unknown instead of ready. And given that, it will not be able to post any status. So, that really what appeared here at the banner. And now, since it all went back to normal, the nodes are ready again. And I’m expecting the services to go back and be healthy again. So, yeah.

Julie: And we should have shown what the Bank of Anthos looked like that. I’m wondering if you’re going over there, if there will be any residuals.

Rona: I can see here, as I said now, I’m expecting my services to go back and to be healthy. And I can see that this is indeed what happened. Every service here, I get an alert, and I see that it recovered. So, that was what I was expecting it to happen, and it did happen.

Julie: That’s excellent. We love that too with chaos engineering is validating that hypothesis. We actually want to validate what the hypothesis is. And then if we find something, it’s like, “Yes, we found something we weren’t expecting,” right?

Rona: Right. Yeah, so I’m thinking maybe we can do the memory attack again, and we’ll finish off with this one, I think.

Julie: Excellent. Okay, I will go ahead and run that memory attack. And I’ll just do it while you can still share your screen. And while I’m setting that up, can you explain how Komodor is different than Lens IDE?

Rona: Yeah.

Julie: Again, I don’t know what Lens IDE is, but how are you different? Thanks for the questions, everybody, too. I’m really loving them.

Rona: Yeah. So, in Lens IDE, you can see the Kubernetes status, all of the resources’ status, the pods, the deployments, the replicas, etc. And we show it as well at Komodor. But in Lens, you don’t have any history of the events, and you don’t have a way to tell what happened the day before or 30 minutes before etc. And actually, you don’t have any history. As well, you don’t have a way of integrating the same as we said before, of integrating a lot of what you’re using in your system, a lot of your tools into the same location. You have to go and check everything on its own, and it’s really hard to do. So, Komodor just integrate everything into 1 location.

And you also, in Komodor, you get insights and you get context. We give you insight about what’s going on in your system. We give you some context about what’s going on in your system. That is not happening in Lens. It just gave you the Kubernetes resources status at the moment. So, that’s the difference.

Julie: Well, thank you. And thanks for the explanation. And that memory attack is running. I’m not seeing anything on your end. And I can attack more things as well.

Rona: Yeah, go for it.

Julie: Yeah, but does anybody have anything we’d like to attack? I mean, we can…

Rona: I think maybe it just takes some time until it happens. Maybe we can run the memory attack on more resources. Because I think that everything is working fine now. Let’s see. I can see my balance. I can maybe deposit funds. Oh, I’m not authorized to deposit funds.

Julie: Wow.

Rona: Yeah.

Julie: Not authorized to deposit funds.

Rona: So, something happened.

Julie: Oh, something happened. Yeah. That’s really frustrating too if you’re trying to deposit funds in your own account, and you would get that attack.

Rona: Right.

Julie: That response. This is actually something interesting to point out, just for folks. When you’re looking at those error messages that folks get, are they readable? And I do kind of like this, because it makes sense to us. But do you think the regular user is going to understand things like servers? So, we just want to think about how we are… how our error messages are targeted to our customer, not engineers. Something that I think we always need to remember when we’re showing those error messages. Well, the attack is running.

Rona: I’m actually able to cause it and send payments. And I can see the balance. So, nothing on my side happened.

Julie: Let me see.

Rona: Maybe you can tack more of the services with the memory attack.

Julie: Absolutely.

Rona: And we’ll see If it affects anything.

Julie: So, what I’m going to go ahead and do, let’s see. Let’s go ahead and… now one of the things with attacks, sometimes you don’t want to attack a lot of things at the same time, because you want to know what’s causing the problem. But for the purpose of having lots of fun on this webinar today, I’m going to go ahead and… let’s see. I’m going to go for the balance reader again, we’ll go for the front end. The ledger writer and the load generator and the transaction history. We’ll just have a little bit of fun here. Let’s see what happens.

Rona: Let’s try them all. Why not?

Julie: Let’s do it. Let’s do a memory attack. Let’s do 300 seconds, and let’s go for 100% of memory. I would predict that this will break things pretty poorly. But let’s see what happens. It is pending. And, again, Kubernetes is pretty resilient. So, let’s see what’s going on here. So, we’re in a pending state right now.

Rona: Okay. So, tell me when it’s running.

Julie: I will. And we are now running.

Rona: Okay, let’s see if anything interesting happens here, hopefully.

Julie: I just think when I ran it earlier, you just fixed everything.

Rona: Yeah. Maybe I did.

Julie: You might have. Or again, we can also determine that the snow has cured everything here.

Rona: Yeah.

Julie: It is always a lot of fun what we can learn about our systems though through running attacks and through troubleshooting tools. Right?

Rona: Right.

Julie: We can learn a lot about how our systems are working. Since all of these are in the process of running and I did run them for 300 seconds, do you want to go ahead and let’s see the Bank of Anthos? Let’s see if anything’s going on. We actually have 4 minutes left too.

Rona: Alright.

Julie: So, this is [unclear] fully resolved before our end time here.

Rona: Yeah, I can do a deposit. So, I’m guessing that the application is okay. And I didn’t get any alerts, unfortunately. So, yeah, I think everything is good for some reason.

Julie: Well, let’s see. As we’re running out in the last couple minutes, is there anything too that you wanted to let folks know?

Rona: I actually want it to work.

Julie: I would like it to work as well. But remember, the way that that container is configured, it will be killed if it runs out of memory. Right? And so, that can stop the attack. And thanks, engine. And I hope I just said your name wrong. I probably did. We’ve certainly enjoyed having some fun. Oh, we know you’ve got a very resilient system.

Rona: Yeah, we do.

Julie: Thank you. I also love the look and feel of Gremlin. I mean, what a great mascot, a little gremlin. Rona, maybe we should play around a little bit more and hang out again after the holidays.

Rona: Right, I think so too. We should.

Julie: Maybe we can give people the opportunity to play with both of the tools. Let’s chat about it.

Rona: It’s actually a lot of fun.

Julie: The attack still has 1 more minute to go.

Rona: Okay. Yeah, because the app is working fine. And also, Komodor is saying that everything is okay. I think it’s the snow.

Julie: I do. I do too. Or I think we were just magical earlier today. Either way, in theory, what we could imagine…

Rona: I think so too.

Julie: What we can imagine is that we have a very amazing app with the Bank of Anthos that just nothing can break aside from shutting everything down. So, congratulations to the engineers that built this, this amazing app that you just can’t break. It does look like the first on the ledger writer is showing successful, transaction history is showing successful, and balance reader is showing successful. So, maybe, we can play again, and you can join us for next time when we really break things in unique and amazing ways.

Rona: Yeah. It will be good. I’m expecting that. Without the snow this time.

Julie: Without… yeah, no snow.

Rona: No snow, okay? Oh, yeah, something happened.

Julie: Oh, there it is. It went right at the top of the hour. Is everybody seeing that? Oh, I hope not too many people jumped off. Yay.

Rona: Yay. Let’s see. Let’s see. Great. Yeah. But it all goes back to… okay. But we see that we got an actual out-of-memory killed issue here. And we know that, in Kubernetes, out-of-memory killed issue can be caused due to an application issue, or due to some underlying infrastructure issue, such as the node having an issue. And actually, maybe the schedule just decided to unschedule my application. But it’s not related actually to my application, it’s just related to the note. So, we can just click here on the view analysis button and see what Komodor tells us.

So, we can see here that we do have an out-of-memory issue. We can see which container is failed. And we can see the memory details for that container. And what Komodor does for us is just running automated checks for us in order to figure out what was the issue, and what is the root cause of this out-of-memory issue. So, we just, we are running some smart checks. Maybe once that if you get an out-of-memory killed issue, your DevOps, your senior DevOps will tell you to check. So, we’re checking if there were some spec changes. We’re checking if the node is overcommitted. We’re checking the quality of service. We’re checking if maybe additional services were impacted. And we’re checking other pods on this service. Maybe it’s just this pod that has an issue, or maybe some other pods on the service have an issue.

We can learn a lot from these checks. And for each check that we’re running, we’re explaining to you, why are we wanting this check and what you can learn from it. So, if you look at the More Info button, we do suggest, we do say what are we doing is in this check, and why are we recommending, for example, to roll back to previous configuration? So, this actually does tell me a lot about what’s going on in my system and what I should do about it. Yeah. So, in this case, maybe I would go and I see that my note is overcommitted. So, I would just try to add more nodes to my cluster, or maybe increase the memory resources that I have on my nodes. And this can solve the issue for me. And if not, I have 2 more options here to choose to pay attention to.

The other checks that they have, we have 3 checks here that we discover as maybe are causing the issue. And the other 2, we’re saying it’s all good. You don’t have any additional services, and you don’t have any other plots that are failing in the service? Yeah, and I think we’re running out of time.

Julie: One thing that I want to say is once you’ve done those things, then rerun the attack to make sure now that all your remediations work. But yes, we did run out of time.

Rona: Yeah.

Julie: This is a lot of fun. And I’m glad we got some things to break. I did run this second memory attack for a little bit longer. We attacked a few more things. Gosh, Rona, thanks for the time today. It’s always fun.

Rona: Thank you. Thank you for creating chaos in my application.

Julie: Any time. And yeah.

Rona: It was fun. Thanks so much.

Julie: Excellent. Well, I think I am actually the one with the host controls. So, I can end this for everyone. One experiment we didn’t perform as per the agenda, which one did we not perform? Okay, I’ll tell you.

Udi: So, we did go a bit off-script. This was an unusual demo, and we did get a lot of snow, as we mentioned.

Rona: Yeah.

Julie: I think we had a little bit too much fun, maybe round 2. And actually, Udi, I’m just going to pass the host controls back over to you, so that you can do whatever you would like with this. So, you are now our host again. And I just want to thank you, Komodor team, for having me today.

Rona: Thank you, Julie.

Udi: Thank you.

Rona: It was a pleasure.

Udi: Yeah. And what I would like to do with my host starts now is just thank you, Julie and the Rona, for doing this amazing event today. And for everybody who participated and stayed around longer than advertised but thank you so much for being here and engaging with us. And see you on our next event. Have a good night or day or night wherever you are in the world.

Rona: Thank you, Udi.

Udi: Bye-bye.

Julie: Bye.

Rona: Bye