This website uses cookies. By continuing to browse, you agree to our Privacy Policy.
  • Home
  • Blog
  • Team Komodor Does Klustered with David Flannagan (AKA Rawkode)

Team Komodor Does Klustered with David Flannagan (AKA Rawkode)

An elite DevOps team from Komodor takes on the Klustered challenge; can they fix a maliciously broken Kubernetes cluster using only the Komodor platform? Let’s find out!

Watch Komodor’s Co-Founding CTO, Itiel Shwartz, and two engineers – Guy Menahem and Nir Shtein leverage the Continuous Kubernetes Reliability Platform that they’ve built to showcase how fast, effortless, and even fun, troubleshooting can be!

Below is an auto-generated transcript of the video:

0:00 so starting from left to right starting with you guy could you please say hello and introduce yourself and share a little bit more if you wish yeah hey

0:07 everyone I’m guy I’m a solution AR commodor I’m here for the last two years

0:12 very excited to join cluster cool everyone my name is

0:19 newstein I’m social engineer I came week after guy and

0:25 it and hey everyone my name is Schwarz I’m the CTO of commodor watch and

0:30 clustered happy to be here all right thank you all very much so this is a

0:35 special edition of clustered and I have broken the cluster personally myself with a handful of braks to hopefully

0:42 reveal and show us the power of commodore as a tool for understanding and debugging problems within your

0:47 customer so with that being said I will now let guy share his screen and we will begin the debugging process best of luck

0:54 team commodor thank you let’s let’s start cool there you go so we got our

1:00 cluster with a condor install so I will consider this cluster

1:07 fixed if we can port forward and visit the duple website that is your

1:13 mission okay sounds easy okay easy

1:18 right you know the reason is it’s scheduling yeah

1:24 so we can go into like the service in Commodore and see the timeline right like how did the service change over

1:30 time and you can see that it’s currently broken and you can try and understand why is it broken so why is it broken I

1:37 see there is a deploy like in 15 25 open

1:43 so do you mean like oh to zoom in okay there you go I think it’s a great story

1:49 of what happened yeah and we can see there was a change in GitHub yeah there was a like an awesome

1:57 update update was it an awesome upate we will

2:03 not and here David remove some environment variables I think they’re

2:08 crucial and change the CL even worse cool so let’s take a look about

2:14 the problem let’s see if they are like connected to one another oh so we definitely can see that the volume the

2:21 PVC not found it’s definitely the problem with the p let’s try to do the roll back right yeah let’s cck R back

2:29 you don’t have permission to do it because it’s your cluster

2:35 David it’s also a getup’s cluster so the roll back wouldn’t actually work it would be overwritten a few minutes later

2:41 however the first you’ve discovered uh essentially the second problem there is

2:48 something that’s more current with this deployment but you yes we will have to fix this too so let’s you mentioned the

2:55 rate at the start somebody said scheduling right let’s let’s look at the pots guy

3:02 the pods yeah let’s look

3:08 atending other than the previous thing let SCH not ready yeah let’s go back we

3:15 have information in here by the way like on the first scheduling of the event would be much easier to see from

3:22here ah I think that one node is not available

3:28no maybe let’s check the pods list first yes let’s try check on the resources

3:35node and see that there is a node that is not ready and schedule disabled maybe

3:41let’s try to unone it it’s C condone so yeah I don’t think we will

3:47have it maybe let’s go to the terminal try to fix it from the terminal you add yourself as a yeah as a

3:56user try to add my my person

4:01do you want you to I okay so we’re doing like a switch just because of security

4:08permissions and because David created the cluster and the account basically H then we need to give like to add another

4:16team member to the commod account so David if you can invite Neil it can be

4:23great yes so let’s go to the notes let’s take

4:28the action and home

4:37unle and also let’s do the wall everything at the same time let’s

4:44do so now nearly like rolling back the service oh maybe the first deploy was

4:51not in Commodore that’s why I didn’t say

4:58that try yeah no

5:04fory yeah we can also take the changes from the GitHub and we can also take the

5:10CH from commod right andth yeah from GitHub I think the service was deleted and then reinitiated

5:17right like it’s generation one it basically means that like David played with the service apparently then he

5:25deleted the deployment then he recreated the deployment with the fil yeah with the failed configuration and that is the

5:32reason that the came fall back because for us it’s a new kind of service B that

5:39it has a different unique ID and this is the first generation of the new Pro

5:45deployment so we need to wall out this workload yeah is the notes okay now

5:53let’s check it yeah they are okay do you like check it on your screen or no no I’m asking I I can no it’s still not

6:01ready ready why is that and contain network is not

6:11[Music]

6:16ready it looks like more like a like a3s issue maybe yeah we we need

6:24the N plugin to be ready can we check maybe on the service as what is

6:30configure do the netor plug in maybe let’s try to take a look in the

6:46Y the network unavailable okay so what’s the reason CN is not

6:58initialized it’s k3s right no these are bare metal

7:04cube admin clusters bare metal

7:11here

7:16um maybe those are the things this is a 48 core 64 gig ram bare

7:24metal machine okay okay so you can have some fun with it right

7:32okay so let’s recap where we are right now using commodor we explored the broken service we identified two bugs

7:38one is that my awesome update in git which you were able to visualize and see right away uh potentially broke the PVC

7:45claim name which we’re going to come back to I would assume I also highlighted that the cluster couldn’t

7:52schedule or pod and you went to the node dashboard and identified that the node

7:57was cordoned and you were able to un coordinate directly from commodor moving us past the scheduling problem however

8:05we now have the node being not ready because of a potential issue with the cni networking

8:12plug-in yeah yeah like we can see that there are like I don’t know

8:18like four different plugins that are installed CSI plugins that are install

8:24and C we looking for cni not CS sorry sorry sorry

8:29maybe should I describe maybe the node what sorry

8:35describe it looks like that we have the celium operator installed yeah uh in in this cluster yeah it might

8:44be with the operator yeah there is maybe the crds ofum like there is the operator maybe

8:52the helm yeah it’s like using hel oh this fail

8:59fail deploy in here fail deploy yeah yeah so we can see there is Agent true

9:06agent not ready as well minimum replica

9:11unavailable yeah but it’s just just the operator itself

9:19on the deploy let take a

9:24look deployment version one

9:30there is a spec Affinity of like label match label ium

9:36operator it’s the P template that is UN unmatch the deployment do you think like

9:42the relevant part in here maybe oh it’s it’s funny it’s running

9:48and ready but it’s like the node is not ready it’s

9:55always fun watching people f a broken cluster

10:02no maybe like look at the hel dashb no like in the hel dashboard we can see

10:08like the current isum like we can see quite a lot on this like what annotation does this the

10:15cler not no I think the not found yeah I just found it

10:24exactly maybe let’s check if there is there is the clust the wall and cl

10:30wall binding in the cluster do do you mean like those like resources which are

10:36not exist M I think we need to create something I’m not sure maybe let’s check

10:41the log of the which is running no I think it’s like one The annotation right like he doesn’t find The annotation on

10:47the Node this why doesn’t inst it on it’s running on the

10:54no so this may be a little bit harder to debug because I think I found a bug and

10:59commodor but try comparing the values from the release three to release

11:04two okay obum yeah you it

11:13okay so there are changes but they don’t actually show up here yeah maybe met

11:20[Music] changes we have only the three version

11:27we don’t have the second no we do have it we do of the operator

11:34ah in the hand does only show changes doesn’t show anything no do two

11:41and then compare with rision two it’s two

11:47compared with division three

11:53no yeah I don’t know why it’s not showing the change for me show the change great then

12:00manifest then compare with version two here when you do here’s the changes you

12:05deleted the service account and all of those do guy I will

12:14do need to do the don’t have permission to I just

12:19perform but well maybe it’s a permission thing

12:25yeah I think the Watcher doesn’t have a permission maybe for that m possible yeah let’s see if also here it

12:32doesn’t have secret FS let’s do also W back to the we can’t

12:39we can’t we can’t our own agent we need the access to the to the class yeah so

12:47we will use it so do all to our agent and then we’ll do to to the seni okay

12:53soorry we sh my screen I’ll stop here yes so we found out that we are missing permission inside Commodore and

13:00it was installed without the possibility of like a roll

13:15back

13:27okay that’s

13:33it I to

13:39just okay you okay yeah that’s that’s

13:44it

13:55okay cool cool cool okay now let’s go back and check if the not is ready now

14:02yeah he is is ready okay and now let’s

14:07check out our so before we continue the upgrad to the Commodore I did in

14:13commodor because it turned on dashboard but I see that it moved the secret access which is probably why the values

14:19didn’t show yeah reason okay I just wanted to make sure I understood what

14:25happened there okay cool and so now the node is ready let’s go back to Services

14:32only thing remaining is the verion for the okay so we have a working node and

14:39you fix the deploy nice work what we yeah now we need to roll it back so what

14:44we we can’t it back because what so we need to edit like let’s edit it yeah I

14:49think that I need to show because remember this is a get UPS P Lanes so you might want me just to push a fix if

14:54you can tell me what you want that fix to be so be reverse yeah let’s just Che the

15:02latest oh I don’t know how to fix it I mean I I just did aw some updates you don’t need to tell me how to fix

15:09it so please your bed Cod yeah so yeah

15:18let’s just get check out to the like this revision if nothing else change in

15:24between then that’s probably the easiest solution you check out with the ref for the

15:30change are you doing the I have pushed an update to

15:36get I’m sharing I’m sharing I’m sharing do you have like a pipeline that

15:42know how to like it automatically deployed yes flux CD is running in the cluster it will detect this change and

15:48it will push it out we can speed up the process and I will do so now just so it’s a bit quicker

15:53yeah so what we can see is that see in near there like the the PVC

16:01change and we got some Environ variables which can be missing and what David

16:07changeed yeah it’s only the P so maybe maybe we still miss those yeah so let’s

16:15maybe start from the let’s wait for the roller to happen

16:21yeah we should see it in commod once the happen you can take look on the walk SPS

16:27to see this still yeah yeah but it’s the previous one yeah it wasn’t

16:34any so we looking for the new

16:41one what so I that push the update however our get offs pipeline is broken due to

16:47the fourth break in the cluster so good luck so there’s another break maybe I’ll

16:55go right like a let yeah let's check Aro is there Aro

17:02flux sorry Source control notification all of them look

17:09healthy what do we check sorry

17:14yeah but maybe it’s misconfigured or something like that seems like thex is

17:20working fine let’s check maybe logs of one of the workflows the controller or some other service The Source controller

17:27like the log message look

17:33good maybe is it updated by Source controller

17:40or I think there is still problem with the like one of the parts are unhealthy

17:46in the sour control yeah the C operator is

17:51pending scheduled because it didn’t match part Infinity Wes if you go to the

17:58walk on the

18:03white click on the the operator okay it’s just because when you did the roll back I set the replicas to one because

18:11we were a single node cluster so you can ignore that pending pod no take the first no he saying like

18:19it’s it’s not theity no no it’s like you said like in the logs of the source control yeah yeah it was

18:26there lo there’s like message s artifa foric let me go back and then garbage

18:33collected one artifact why did Garbage collected it and then a lot of changes but why did the garbage

18:39collected one artifact maybe it’s related to that I don’t know

18:46yeah Chang like this is the change this is

18:52what you mean right again yeah and then like one

18:57afterward remove typo in PVC name yeah this is the

19:05commit like d [Music]

19:10question yeah but what does it mean let’s see if we got any warnings in

19:15here or you can do like maybe

19:24like so what happen is one point in the it it find out that there there was a

19:30change M but for some reason the garbage

19:36collected it we need to change something in FL

19:44yeah let’s check the configuration maybe it’s something about this configuration

19:50CH yeah this by way in the customized controller it always failed the FL CD

19:57name is changed from system to and what is the name in The

20:07Log saw that yeah yeah okay so your rollback for cium actually fixed this

20:13problem but there’s a 10 minute sync time on the customization so I’ve just encouraged it to run again

20:21so so we don’t need to do anything as long as this customization runs no it’s

20:27still failing it’s is in networking and cluster is not working yeah I don’t know if your RB back for celum fixed the

20:34problem I think the RO of C didn’t no like there if you look at the logs of the customized controller there are

20:40really bad logs there and it says that it failed on like HTTP faed call in web let me just show

20:50that everyone can see yeah you consideration fail after second has the

20:55cnpg service who is it name

21:07rout the cpg thing is I think the

21:20network what is this service the cpg yeah there is like one thing here I’m

21:27looking at logs of the cpg is it a p it’s there is a pod but

21:35like the latest message is like periodic PLS certificate

21:41maintenance which I don’t really

21:51know e on this series what was the in it doesn’t likeed like with the

21:58relevant service basically yeah so let me give you context on that selum break right because you did a rule back but

22:04you didn’t really identify what the problem was and what changed and uh I don’t want is to debug

22:11something that you can’t have visibility into right now because of that secret values thing so in theum health chart

22:18what I did was disable the agent which is definitely rolled back because we can see the agent is now deployed next to the operator however I also disabled the

22:26ebpf cube proxy replacement and you may notice there’s no Cube proxy in this cluster so in the interest of not

22:34debugging something that we’re not entirely sure if it’s been fixed or not I’m going to redeploy celium right now and assume the r back hopefully fixed it

22:41properly and if we still have an issue then I’m debugging with you because I’m not really sure what the problem will be

22:46after that let let’s no maybe

22:55worse it’s not that okay so my my update for celium has

23:02triggered a redeploy of celium so the config map definitely changed so we may

23:08be moving in to a better

23:19situation yeah maybe delete the latest cium operator oh who can delete

23:25things delete the celium operator

23:32okay the previous one yeah

23:38the the operator wait a sec the one that is

23:43pending no not the one that is the other one what will happen yeah so go to the C operator to

23:51the are you sure yeah I’m going to delete the

23:56oldum oh that’s a bold move I like it yeah

24:02yeah we’re not playing around here you know so now the the new version is

24:09running and should or we won’t have anyone there

24:14rning right it seem like it’s face SCH hey that

24:20worked did it work yeah yeah well we had no doubts

24:26about

24:32seems like the new of theum doesn’t I think that’s okay because he

24:37has like two replica but now like it’s a new one that is running great so

24:44now I can scare it to one

24:49keep issues you know I’m scaling the the c one no no I

24:56think the no I think now it’s okay now let read the the logs of the flux thingy

25:01there the customized one I think right let’s The Source I think the C oh the Dr

25:08is just sinking it is yeah you see it’sing wall yeah let see

25:15it when a doubt delete stum operators fixes

25:21everything oh now it’s healthy look on the Yeahs and I do like

25:27a say okay let me share my screen and I’ll

25:33test the website for you right moment

25:38yeah and you understand like all you get is like druple working that’s

25:44like that’s like the best scenario Drupal is running we have a

25:49problem with our database configuration but maybe we don’t need it so an interest of testing we can go port

25:56forward not

26:03do also

26:09have okay so it’s almost working let’s see if we can actually open it in a

26:22Brer don’t be too happy

26:30you try to save it now he’s going to try to use the

26:35database so this shouldn’t actually be needed but the net script is unable to run for the same reason that this

26:42command will fail oh no our duple instance is unable

26:49to communicate with the postgress database back over to you and this is the last

26:54break because maybe the enir right it’s going to time out it cannot post

27:00dle cannot speak to post G there we go temp failure DNS

27:05resolution yeah back back to you last break there we go so it cannot resolve

27:12the database and okay so let’s check the

27:19events of the

27:24everything elction Network policy maybe you did a lot of network policy

27:31changes indeed why did you do it the event you can see policy changes

27:42and less Network policy [Music] change

27:48was I

27:54scraping so we saw that there are a lot of network policy changes and it look like someone changed

28:01the Untitled policy yeah there was a policy that prevent us for executing

28:09request the cluster there is a policy type of igress so let’s try to take on

28:15action and I mean what I love about comar here right is this the vent log as a gold mine of information and you can see this

28:22network policy was created in the last 24 hours it’s obviously well intended but you know mistakes are easy to make

28:28in kubernetes very easy

28:34then all right if you can stop sharing your screen I will give the application another spin I think we should be

28:40sitting pretty now Cas I still have my portf running if we remove the install script

28:48yeah we’re holding the view and if we make

28:56sure okay it completed 16 seconds ago the

29:03database is now running oh I shouldn’t have to do this

29:10but we run through it anyway that’s

29:16it woo well done you fix all the brakes on the cluster and duple is now working

29:23as [Music] intended

29:29so you know a small recap and then I’ll e get back up day right but that’s was a whole lot of fun for me right um I

29:36actually found it really difficult to break the developer the consumer API of

29:42kubernetes in a way that commodor couldn’t show right up front what the problem was with the GE integration the

29:49diffs the helm charts the node information even revealing all the labels and annotations everything was

29:55just there in front of me and I think that’s just superow for people that have to operate kubernetes so I’ll thank you

30:00all for your work it made it harder to break but I hope you enjoyed each of the breaks that were presented to you and uh

30:06yeah any final remarks from anyone no it was super

30:13fun

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.