Home
Komodor Blog
Video: Team Komodor Does Klustered with David Flannagan (AKA Rawkode)

Video: Team Komodor Does Klustered with David Flannagan (AKA Rawkode)

Udi Hofesh, Head of Kommunity

13 min read January 7th, 2024

An elite DevOps team from Komodor takes on the Klustered challenge; can they fix a maliciously broken Kubernetes cluster using only the Komodor platform? Let’s find out!

Watch Komodor’s Co-Founding CTO, Itiel Shwartz, and two engineers – Guy Menahem and Nir Shtein leverage the Continuous Kubernetes Reliability Platform that they’ve built to showcase how fast, effortless, and even fun, troubleshooting can be!

Below is an auto-generated transcript of the video:

0:00 so starting from left to right starting with you guy could you please say hello and introduce yourself and share a little bit more if you wish yeah hey

0:07 everyone I’m guy I’m a solution AR commodor I’m here for the last two years

0:12 very excited to join cluster cool everyone my name is

0:19 newstein I’m social engineer I came week after guy and

0:25 it and hey everyone my name is Schwarz I’m the CTO of commodor watch and

0:30 clustered happy to be here all right thank you all very much so this is a

0:35 special edition of clustered and I have broken the cluster personally myself with a handful of braks to hopefully

0:42 reveal and show us the power of commodore as a tool for understanding and debugging problems within your

0:47 customer so with that being said I will now let guy share his screen and we will begin the debugging process best of luck

0:54 team commodor thank you let’s let’s start cool there you go so we got our

1:00 cluster with a condor install so I will consider this cluster

1:07 fixed if we can port forward and visit the duple website that is your

1:13 mission okay sounds easy okay easy

1:18 right you know the reason is it’s scheduling yeah

1:24 so we can go into like the service in Commodore and see the timeline right like how did the service change over

1:30 time and you can see that it’s currently broken and you can try and understand why is it broken so why is it broken I

1:37 see there is a deploy like in 15 25 open

1:43 so do you mean like oh to zoom in okay there you go I think it’s a great story

1:49 of what happened yeah and we can see there was a change in GitHub yeah there was a like an awesome

1:57 update update was it an awesome upate we will

2:03 not and here David remove some environment variables I think they’re

2:08 crucial and change the CL even worse cool so let’s take a look about

2:14 the problem let’s see if they are like connected to one another oh so we definitely can see that the volume the

2:21 PVC not found it’s definitely the problem with the p let’s try to do the roll back right yeah let’s cck R back

2:29 you don’t have permission to do it because it’s your cluster

2:35 David it’s also a getup’s cluster so the roll back wouldn’t actually work it would be overwritten a few minutes later

2:41 however the first you’ve discovered uh essentially the second problem there is

2:48 something that’s more current with this deployment but you yes we will have to fix this too so let’s you mentioned the

2:55 rate at the start somebody said scheduling right let’s let’s look at the pots guy

3:02 the pods yeah let’s look

3:08 atending other than the previous thing let SCH not ready yeah let’s go back we

3:15 have information in here by the way like on the first scheduling of the event would be much easier to see from

3:22here ah I think that one node is not available

3:28no maybe let’s check the pods list first yes let’s try check on the resources

3:35node and see that there is a node that is not ready and schedule disabled maybe

3:41let’s try to unone it it’s C condone so yeah I don’t think we will

3:47have it maybe let’s go to the terminal try to fix it from the terminal you add yourself as a yeah as a

3:56user try to add my my person

4:01do you want you to I okay so we’re doing like a switch just because of security

4:08permissions and because David created the cluster and the account basically H then we need to give like to add another

4:16team member to the commod account so David if you can invite Neil it can be

4:23great yes so let’s go to the notes let’s take

4:28the action and home

4:37unle and also let’s do the wall everything at the same time let’s

4:44do so now nearly like rolling back the service oh maybe the first deploy was

4:51not in Commodore that’s why I didn’t say

4:58that try yeah no

5:04fory yeah we can also take the changes from the GitHub and we can also take the

5:10CH from commod right andth yeah from GitHub I think the service was deleted and then reinitiated

5:17right like it’s generation one it basically means that like David played with the service apparently then he

5:25deleted the deployment then he recreated the deployment with the fil yeah with the failed configuration and that is the

5:32reason that the came fall back because for us it’s a new kind of service B that

5:39it has a different unique ID and this is the first generation of the new Pro

5:45deployment so we need to wall out this workload yeah is the notes okay now

5:53let’s check it yeah they are okay do you like check it on your screen or no no I’m asking I I can no it’s still not

6:01ready ready why is that and contain network is not

6:11[Music]

6:16ready it looks like more like a like a3s issue maybe yeah we we need

6:24the N plugin to be ready can we check maybe on the service as what is

6:30configure do the netor plug in maybe let’s try to take a look in the

6:46Y the network unavailable okay so what’s the reason CN is not

6:58initialized it’s k3s right no these are bare metal

7:04cube admin clusters bare metal

7:11here

7:16um maybe those are the things this is a 48 core 64 gig ram bare

7:24metal machine okay okay so you can have some fun with it right

7:32okay so let’s recap where we are right now using commodor we explored the broken service we identified two bugs

7:38one is that my awesome update in git which you were able to visualize and see right away uh potentially broke the PVC

7:45claim name which we’re going to come back to I would assume I also highlighted that the cluster couldn’t

7:52schedule or pod and you went to the node dashboard and identified that the node

7:57was cordoned and you were able to un coordinate directly from commodor moving us past the scheduling problem however

8:05we now have the node being not ready because of a potential issue with the cni networking

8:12plug-in yeah yeah like we can see that there are like I don’t know

8:18like four different plugins that are installed CSI plugins that are install

8:24and C we looking for cni not CS sorry sorry sorry

8:29maybe should I describe maybe the node what sorry

8:35describe it looks like that we have the celium operator installed yeah uh in in this cluster yeah it might

8:44be with the operator yeah there is maybe the crds ofum like there is the operator maybe

8:52the helm yeah it’s like using hel oh this fail

8:59fail deploy in here fail deploy yeah yeah so we can see there is Agent true

9:06agent not ready as well minimum replica

9:11unavailable yeah but it’s just just the operator itself

9:19on the deploy let take a

9:24look deployment version one

9:30there is a spec Affinity of like label match label ium

9:36operator it’s the P template that is UN unmatch the deployment do you think like

9:42the relevant part in here maybe oh it’s it’s funny it’s running

9:48and ready but it’s like the node is not ready it’s

9:55always fun watching people f a broken cluster

10:02no maybe like look at the hel dashb no like in the hel dashboard we can see

10:08like the current isum like we can see quite a lot on this like what annotation does this the

10:15cler not no I think the not found yeah I just found it

10:24exactly maybe let’s check if there is there is the clust the wall and cl

10:30wall binding in the cluster do do you mean like those like resources which are

10:36not exist M I think we need to create something I’m not sure maybe let’s check

10:41the log of the which is running no I think it’s like one The annotation right like he doesn’t find The annotation on

10:47the Node this why doesn’t inst it on it’s running on the

10:54no so this may be a little bit harder to debug because I think I found a bug and

10:59commodor but try comparing the values from the release three to release

11:04two okay obum yeah you it

11:13okay so there are changes but they don’t actually show up here yeah maybe met

11:20[Music] changes we have only the three version

11:27we don’t have the second no we do have it we do of the operator

11:34ah in the hand does only show changes doesn’t show anything no do two

11:41and then compare with rision two it’s two

11:47compared with division three

11:53no yeah I don’t know why it’s not showing the change for me show the change great then

12:00manifest then compare with version two here when you do here’s the changes you

12:05deleted the service account and all of those do guy I will

12:14do need to do the don’t have permission to I just

12:19perform but well maybe it’s a permission thing

12:25yeah I think the Watcher doesn’t have a permission maybe for that m possible yeah let’s see if also here it

12:32doesn’t have secret FS let’s do also W back to the we can’t

12:39we can’t we can’t our own agent we need the access to the to the class yeah so

12:47we will use it so do all to our agent and then we’ll do to to the seni okay

12:53soorry we sh my screen I’ll stop here yes so we found out that we are missing permission inside Commodore and

13:00it was installed without the possibility of like a roll

13:15back

13:27okay that’s

13:33it I to

13:39just okay you okay yeah that’s that’s

13:44it

13:55okay cool cool cool okay now let’s go back and check if the not is ready now

14:02yeah he is is ready okay and now let’s

14:07check out our so before we continue the upgrad to the Commodore I did in

14:13commodor because it turned on dashboard but I see that it moved the secret access which is probably why the values

14:19didn’t show yeah reason okay I just wanted to make sure I understood what

14:25happened there okay cool and so now the node is ready let’s go back to Services

14:32only thing remaining is the verion for the okay so we have a working node and

14:39you fix the deploy nice work what we yeah now we need to roll it back so what

14:44we we can’t it back because what so we need to edit like let’s edit it yeah I

14:49think that I need to show because remember this is a get UPS P Lanes so you might want me just to push a fix if

14:54you can tell me what you want that fix to be so be reverse yeah let’s just Che the

15:02latest oh I don’t know how to fix it I mean I I just did aw some updates you don’t need to tell me how to fix

15:09it so please your bed Cod yeah so yeah

15:18let’s just get check out to the like this revision if nothing else change in

15:24between then that’s probably the easiest solution you check out with the ref for the

15:30change are you doing the I have pushed an update to

15:36get I’m sharing I’m sharing I’m sharing do you have like a pipeline that

15:42know how to like it automatically deployed yes flux CD is running in the cluster it will detect this change and

15:48it will push it out we can speed up the process and I will do so now just so it’s a bit quicker

15:53yeah so what we can see is that see in near there like the the PVC

16:01change and we got some Environ variables which can be missing and what David

16:07changeed yeah it’s only the P so maybe maybe we still miss those yeah so let’s

16:15maybe start from the let’s wait for the roller to happen

16:21yeah we should see it in commod once the happen you can take look on the walk SPS

16:27to see this still yeah yeah but it’s the previous one yeah it wasn’t

16:34any so we looking for the new

16:41one what so I that push the update however our get offs pipeline is broken due to

16:47the fourth break in the cluster so good luck so there’s another break maybe I’ll

16:55go right like a let yeah let's check Aro is there Aro

17:02flux sorry Source control notification all of them look

17:09healthy what do we check sorry

17:14yeah but maybe it’s misconfigured or something like that seems like thex is

17:20working fine let’s check maybe logs of one of the workflows the controller or some other service The Source controller

17:27like the log message look

17:33good maybe is it updated by Source controller

17:40or I think there is still problem with the like one of the parts are unhealthy

17:46in the sour control yeah the C operator is

17:51pending scheduled because it didn’t match part Infinity Wes if you go to the

17:58walk on the

18:03white click on the the operator okay it’s just because when you did the roll back I set the replicas to one because

18:11we were a single node cluster so you can ignore that pending pod no take the first no he saying like

18:19it’s it’s not theity no no it’s like you said like in the logs of the source control yeah yeah it was

18:26there lo there’s like message s artifa foric let me go back and then garbage

18:33collected one artifact why did Garbage collected it and then a lot of changes but why did the garbage

18:39collected one artifact maybe it’s related to that I don’t know

18:46yeah Chang like this is the change this is

18:52what you mean right again yeah and then like one

18:57afterward remove typo in PVC name yeah this is the

19:05commit like d [Music]

19:10question yeah but what does it mean let’s see if we got any warnings in

19:15here or you can do like maybe

19:24like so what happen is one point in the it it find out that there there was a

19:30change M but for some reason the garbage

19:36collected it we need to change something in FL

19:44yeah let’s check the configuration maybe it’s something about this configuration

19:50CH yeah this by way in the customized controller it always failed the FL CD

19:57name is changed from system to and what is the name in The

20:07Log saw that yeah yeah okay so your rollback for cium actually fixed this

20:13problem but there’s a 10 minute sync time on the customization so I’ve just encouraged it to run again

20:21so so we don’t need to do anything as long as this customization runs no it’s

20:27still failing it’s is in networking and cluster is not working yeah I don’t know if your RB back for celum fixed the

20:34problem I think the RO of C didn’t no like there if you look at the logs of the customized controller there are

20:40really bad logs there and it says that it failed on like HTTP faed call in web let me just show

20:50that everyone can see yeah you consideration fail after second has the

20:55cnpg service who is it name

21:07rout the cpg thing is I think the

21:20network what is this service the cpg yeah there is like one thing here I’m

21:27looking at logs of the cpg is it a p it’s there is a pod but

21:35like the latest message is like periodic PLS certificate

21:41maintenance which I don’t really

21:51know e on this series what was the in it doesn’t likeed like with the

21:58relevant service basically yeah so let me give you context on that selum break right because you did a rule back but

22:04you didn’t really identify what the problem was and what changed and uh I don’t want is to debug

22:11something that you can’t have visibility into right now because of that secret values thing so in theum health chart

22:18what I did was disable the agent which is definitely rolled back because we can see the agent is now deployed next to the operator however I also disabled the

22:26ebpf cube proxy replacement and you may notice there’s no Cube proxy in this cluster so in the interest of not

22:34debugging something that we’re not entirely sure if it’s been fixed or not I’m going to redeploy celium right now and assume the r back hopefully fixed it

22:41properly and if we still have an issue then I’m debugging with you because I’m not really sure what the problem will be

22:46after that let let’s no maybe

22:55worse it’s not that okay so my my update for celium has

23:02triggered a redeploy of celium so the config map definitely changed so we may

23:08be moving in to a better

23:19situation yeah maybe delete the latest cium operator oh who can delete

23:25things delete the celium operator

23:32okay the previous one yeah

23:38the the operator wait a sec the one that is

23:43pending no not the one that is the other one what will happen yeah so go to the C operator to

23:51the are you sure yeah I’m going to delete the

23:56oldum oh that’s a bold move I like it yeah

24:02yeah we’re not playing around here you know so now the the new version is

24:09running and should or we won’t have anyone there

24:14rning right it seem like it’s face SCH hey that

24:20worked did it work yeah yeah well we had no doubts

24:26about

24:32seems like the new of theum doesn’t I think that’s okay because he

24:37has like two replica but now like it’s a new one that is running great so

24:44now I can scare it to one

24:49keep issues you know I’m scaling the the c one no no I

24:56think the no I think now it’s okay now let read the the logs of the flux thingy

25:01there the customized one I think right let’s The Source I think the C oh the Dr

25:08is just sinking it is yeah you see it’sing wall yeah let see

25:15it when a doubt delete stum operators fixes

25:21everything oh now it’s healthy look on the Yeahs and I do like

25:27a say okay let me share my screen and I’ll

25:33test the website for you right moment

25:38yeah and you understand like all you get is like druple working that’s

25:44like that’s like the best scenario Drupal is running we have a

25:49problem with our database configuration but maybe we don’t need it so an interest of testing we can go port

25:56forward not

26:03do also

26:09have okay so it’s almost working let’s see if we can actually open it in a

26:22Brer don’t be too happy

26:30you try to save it now he’s going to try to use the

26:35database so this shouldn’t actually be needed but the net script is unable to run for the same reason that this

26:42command will fail oh no our duple instance is unable

26:49to communicate with the postgress database back over to you and this is the last

26:54break because maybe the enir right it’s going to time out it cannot post

27:00dle cannot speak to post G there we go temp failure DNS

27:05resolution yeah back back to you last break there we go so it cannot resolve

27:12the database and okay so let’s check the

27:19events of the

27:24everything elction Network policy maybe you did a lot of network policy

27:31changes indeed why did you do it the event you can see policy changes

27:42and less Network policy [Music] change

27:48was I

27:54scraping so we saw that there are a lot of network policy changes and it look like someone changed

28:01the Untitled policy yeah there was a policy that prevent us for executing

28:09request the cluster there is a policy type of igress so let’s try to take on

28:15action and I mean what I love about comar here right is this the vent log as a gold mine of information and you can see this

28:22network policy was created in the last 24 hours it’s obviously well intended but you know mistakes are easy to make

28:28in kubernetes very easy

28:34then all right if you can stop sharing your screen I will give the application another spin I think we should be

28:40sitting pretty now Cas I still have my portf running if we remove the install script

28:48yeah we’re holding the view and if we make

28:56sure okay it completed 16 seconds ago the

29:03database is now running oh I shouldn’t have to do this

29:10but we run through it anyway that’s

29:16it woo well done you fix all the brakes on the cluster and duple is now working

29:23as [Music] intended

29:29so you know a small recap and then I’ll e get back up day right but that’s was a whole lot of fun for me right um I

29:36actually found it really difficult to break the developer the consumer API of

29:42kubernetes in a way that commodor couldn’t show right up front what the problem was with the GE integration the

29:49diffs the helm charts the node information even revealing all the labels and annotations everything was

29:55just there in front of me and I think that’s just superow for people that have to operate kubernetes so I’ll thank you

30:00all for your work it made it harder to break but I hope you enjoyed each of the breaks that were presented to you and uh

30:06yeah any final remarks from anyone no it was super

30:13fun

Latest Blogs

Leveraging GenAI to Enhance Kubernetes Reliability

Introducing KlaudiaAI: Redefining Kubernetes Troubleshooting with the Power of AI

We’re excited to announce today our latest product, KlaudiaAI, designed from the ground up to tackle the unique challenges of Kubernetes operations.

What Is GitOps? The Complete Guide

This article will take you through the benefits of GitOps, how it works, with what technologies, and how it impacts the development processes and life cycles of your engineering teams. It will also talk about two of the popular GitOps-enabling tools: ArgoCD and Flux.

Video: Team Komodor Does Klustered with David Flannagan (AKA Rawkode)

Latest Blogs

Leveraging GenAI to Enhance Kubernetes Reliability

Introducing KlaudiaAI: Redefining Kubernetes Troubleshooting with the Power of AI

What Is GitOps? The Complete Guide

Sign up for FREE