Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Automate and optimize AI/ML workloads on K8s
Easily manage Kubernetes Edge clusters
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Your single source of truth for everything regarding Komodor’s Platform.
Keep up with all the latest feature releases and product updates.
Leverage Komodor’s public APIs in your internal development workflows.
Get answers to any Komodor-related questions, report bugs, and submit feature requests.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
An elite DevOps team from Komodor takes on the Klustered challenge; can they fix a maliciously broken Kubernetes cluster using only the Komodor platform? Let’s find out!
Watch Komodor’s Co-Founding CTO, Itiel Shwartz, and two engineers – Guy Menahem and Nir Shtein leverage the Continuous Kubernetes Reliability Platform that they’ve built to showcase how fast, effortless, and even fun, troubleshooting can be!
Below is an auto-generated transcript of the video:
0:00 so starting from left to right starting with you guy could you please say hello and introduce yourself and share a little bit more if you wish yeah hey
0:07 everyone I’m guy I’m a solution AR commodor I’m here for the last two years
0:12 very excited to join cluster cool everyone my name is
0:19 newstein I’m social engineer I came week after guy and
0:25 it and hey everyone my name is Schwarz I’m the CTO of commodor watch and
0:30 clustered happy to be here all right thank you all very much so this is a
0:35 special edition of clustered and I have broken the cluster personally myself with a handful of braks to hopefully
0:42 reveal and show us the power of commodore as a tool for understanding and debugging problems within your
0:47 customer so with that being said I will now let guy share his screen and we will begin the debugging process best of luck
0:54 team commodor thank you let’s let’s start cool there you go so we got our
1:00 cluster with a condor install so I will consider this cluster
1:07 fixed if we can port forward and visit the duple website that is your
1:13 mission okay sounds easy okay easy
1:18 right you know the reason is it’s scheduling yeah
1:24 so we can go into like the service in Commodore and see the timeline right like how did the service change over
1:30 time and you can see that it’s currently broken and you can try and understand why is it broken so why is it broken I
1:37 see there is a deploy like in 15 25 open
1:43 so do you mean like oh to zoom in okay there you go I think it’s a great story
1:49 of what happened yeah and we can see there was a change in GitHub yeah there was a like an awesome
1:57 update update was it an awesome upate we will
2:03 not and here David remove some environment variables I think they’re
2:08 crucial and change the CL even worse cool so let’s take a look about
2:14 the problem let’s see if they are like connected to one another oh so we definitely can see that the volume the
2:21 PVC not found it’s definitely the problem with the p let’s try to do the roll back right yeah let’s cck R back
2:29 you don’t have permission to do it because it’s your cluster
2:35 David it’s also a getup’s cluster so the roll back wouldn’t actually work it would be overwritten a few minutes later
2:41 however the first you’ve discovered uh essentially the second problem there is
2:48 something that’s more current with this deployment but you yes we will have to fix this too so let’s you mentioned the
2:55 rate at the start somebody said scheduling right let’s let’s look at the pots guy
3:02 the pods yeah let’s look
3:08 atending other than the previous thing let SCH not ready yeah let’s go back we
3:15 have information in here by the way like on the first scheduling of the event would be much easier to see from
3:22here ah I think that one node is not available
3:28no maybe let’s check the pods list first yes let’s try check on the resources
3:35node and see that there is a node that is not ready and schedule disabled maybe
3:41let’s try to unone it it’s C condone so yeah I don’t think we will
3:47have it maybe let’s go to the terminal try to fix it from the terminal you add yourself as a yeah as a
3:56user try to add my my person
4:01do you want you to I okay so we’re doing like a switch just because of security
4:08permissions and because David created the cluster and the account basically H then we need to give like to add another
4:16team member to the commod account so David if you can invite Neil it can be
4:23great yes so let’s go to the notes let’s take
4:28the action and home
4:37unle and also let’s do the wall everything at the same time let’s
4:44do so now nearly like rolling back the service oh maybe the first deploy was
4:51not in Commodore that’s why I didn’t say
4:58that try yeah no
5:04fory yeah we can also take the changes from the GitHub and we can also take the
5:10CH from commod right andth yeah from GitHub I think the service was deleted and then reinitiated
5:17right like it’s generation one it basically means that like David played with the service apparently then he
5:25deleted the deployment then he recreated the deployment with the fil yeah with the failed configuration and that is the
5:32reason that the came fall back because for us it’s a new kind of service B that
5:39it has a different unique ID and this is the first generation of the new Pro
5:45deployment so we need to wall out this workload yeah is the notes okay now
5:53let’s check it yeah they are okay do you like check it on your screen or no no I’m asking I I can no it’s still not
6:01ready ready why is that and contain network is not
6:11[Music]
6:16ready it looks like more like a like a3s issue maybe yeah we we need
6:24the N plugin to be ready can we check maybe on the service as what is
6:30configure do the netor plug in maybe let’s try to take a look in the
6:46Y the network unavailable okay so what’s the reason CN is not
6:58initialized it’s k3s right no these are bare metal
7:04cube admin clusters bare metal
7:11here
7:16um maybe those are the things this is a 48 core 64 gig ram bare
7:24metal machine okay okay so you can have some fun with it right
7:32okay so let’s recap where we are right now using commodor we explored the broken service we identified two bugs
7:38one is that my awesome update in git which you were able to visualize and see right away uh potentially broke the PVC
7:45claim name which we’re going to come back to I would assume I also highlighted that the cluster couldn’t
7:52schedule or pod and you went to the node dashboard and identified that the node
7:57was cordoned and you were able to un coordinate directly from commodor moving us past the scheduling problem however
8:05we now have the node being not ready because of a potential issue with the cni networking
8:12plug-in yeah yeah like we can see that there are like I don’t know
8:18like four different plugins that are installed CSI plugins that are install
8:24and C we looking for cni not CS sorry sorry sorry
8:29maybe should I describe maybe the node what sorry
8:35describe it looks like that we have the celium operator installed yeah uh in in this cluster yeah it might
8:44be with the operator yeah there is maybe the crds ofum like there is the operator maybe
8:52the helm yeah it’s like using hel oh this fail
8:59fail deploy in here fail deploy yeah yeah so we can see there is Agent true
9:06agent not ready as well minimum replica
9:11unavailable yeah but it’s just just the operator itself
9:19on the deploy let take a
9:24look deployment version one
9:30there is a spec Affinity of like label match label ium
9:36operator it’s the P template that is UN unmatch the deployment do you think like
9:42the relevant part in here maybe oh it’s it’s funny it’s running
9:48and ready but it’s like the node is not ready it’s
9:55always fun watching people f a broken cluster
10:02no maybe like look at the hel dashb no like in the hel dashboard we can see
10:08like the current isum like we can see quite a lot on this like what annotation does this the
10:15cler not no I think the not found yeah I just found it
10:24exactly maybe let’s check if there is there is the clust the wall and cl
10:30wall binding in the cluster do do you mean like those like resources which are
10:36not exist M I think we need to create something I’m not sure maybe let’s check
10:41the log of the which is running no I think it’s like one The annotation right like he doesn’t find The annotation on
10:47the Node this why doesn’t inst it on it’s running on the
10:54no so this may be a little bit harder to debug because I think I found a bug and
10:59commodor but try comparing the values from the release three to release
11:04two okay obum yeah you it
11:13okay so there are changes but they don’t actually show up here yeah maybe met
11:20[Music] changes we have only the three version
11:27we don’t have the second no we do have it we do of the operator
11:34ah in the hand does only show changes doesn’t show anything no do two
11:41and then compare with rision two it’s two
11:47compared with division three
11:53no yeah I don’t know why it’s not showing the change for me show the change great then
12:00manifest then compare with version two here when you do here’s the changes you
12:05deleted the service account and all of those do guy I will
12:14do need to do the don’t have permission to I just
12:19perform but well maybe it’s a permission thing
12:25yeah I think the Watcher doesn’t have a permission maybe for that m possible yeah let’s see if also here it
12:32doesn’t have secret FS let’s do also W back to the we can’t
12:39we can’t we can’t our own agent we need the access to the to the class yeah so
12:47we will use it so do all to our agent and then we’ll do to to the seni okay
12:53soorry we sh my screen I’ll stop here yes so we found out that we are missing permission inside Commodore and
13:00it was installed without the possibility of like a roll
13:15back
13:27okay that’s
13:33it I to
13:39just okay you okay yeah that’s that’s
13:44it
13:55okay cool cool cool okay now let’s go back and check if the not is ready now
14:02yeah he is is ready okay and now let’s
14:07check out our so before we continue the upgrad to the Commodore I did in
14:13commodor because it turned on dashboard but I see that it moved the secret access which is probably why the values
14:19didn’t show yeah reason okay I just wanted to make sure I understood what
14:25happened there okay cool and so now the node is ready let’s go back to Services
14:32only thing remaining is the verion for the okay so we have a working node and
14:39you fix the deploy nice work what we yeah now we need to roll it back so what
14:44we we can’t it back because what so we need to edit like let’s edit it yeah I
14:49think that I need to show because remember this is a get UPS P Lanes so you might want me just to push a fix if
14:54you can tell me what you want that fix to be so be reverse yeah let’s just Che the
15:02latest oh I don’t know how to fix it I mean I I just did aw some updates you don’t need to tell me how to fix
15:09it so please your bed Cod yeah so yeah
15:18let’s just get check out to the like this revision if nothing else change in
15:24between then that’s probably the easiest solution you check out with the ref for the
15:30change are you doing the I have pushed an update to
15:36get I’m sharing I’m sharing I’m sharing do you have like a pipeline that
15:42know how to like it automatically deployed yes flux CD is running in the cluster it will detect this change and
15:48it will push it out we can speed up the process and I will do so now just so it’s a bit quicker
15:53yeah so what we can see is that see in near there like the the PVC
16:01change and we got some Environ variables which can be missing and what David
16:07changeed yeah it’s only the P so maybe maybe we still miss those yeah so let’s
16:15maybe start from the let’s wait for the roller to happen
16:21yeah we should see it in commod once the happen you can take look on the walk SPS
16:27to see this still yeah yeah but it’s the previous one yeah it wasn’t
16:34any so we looking for the new
16:41one what so I that push the update however our get offs pipeline is broken due to
16:47the fourth break in the cluster so good luck so there’s another break maybe I’ll
16:55go right like a let yeah let's check Aro is there Aro
17:02flux sorry Source control notification all of them look
17:09healthy what do we check sorry
17:14yeah but maybe it’s misconfigured or something like that seems like thex is
17:20working fine let’s check maybe logs of one of the workflows the controller or some other service The Source controller
17:27like the log message look
17:33good maybe is it updated by Source controller
17:40or I think there is still problem with the like one of the parts are unhealthy
17:46in the sour control yeah the C operator is
17:51pending scheduled because it didn’t match part Infinity Wes if you go to the
17:58walk on the
18:03white click on the the operator okay it’s just because when you did the roll back I set the replicas to one because
18:11we were a single node cluster so you can ignore that pending pod no take the first no he saying like
18:19it’s it’s not theity no no it’s like you said like in the logs of the source control yeah yeah it was
18:26there lo there’s like message s artifa foric let me go back and then garbage
18:33collected one artifact why did Garbage collected it and then a lot of changes but why did the garbage
18:39collected one artifact maybe it’s related to that I don’t know
18:46yeah Chang like this is the change this is
18:52what you mean right again yeah and then like one
18:57afterward remove typo in PVC name yeah this is the
19:05commit like d [Music]
19:10question yeah but what does it mean let’s see if we got any warnings in
19:15here or you can do like maybe
19:24like so what happen is one point in the it it find out that there there was a
19:30change M but for some reason the garbage
19:36collected it we need to change something in FL
19:44yeah let’s check the configuration maybe it’s something about this configuration
19:50CH yeah this by way in the customized controller it always failed the FL CD
19:57name is changed from system to and what is the name in The
20:07Log saw that yeah yeah okay so your rollback for cium actually fixed this
20:13problem but there’s a 10 minute sync time on the customization so I’ve just encouraged it to run again
20:21so so we don’t need to do anything as long as this customization runs no it’s
20:27still failing it’s is in networking and cluster is not working yeah I don’t know if your RB back for celum fixed the
20:34problem I think the RO of C didn’t no like there if you look at the logs of the customized controller there are
20:40really bad logs there and it says that it failed on like HTTP faed call in web let me just show
20:50that everyone can see yeah you consideration fail after second has the
20:55cnpg service who is it name
21:07rout the cpg thing is I think the
21:20network what is this service the cpg yeah there is like one thing here I’m
21:27looking at logs of the cpg is it a p it’s there is a pod but
21:35like the latest message is like periodic PLS certificate
21:41maintenance which I don’t really
21:51know e on this series what was the in it doesn’t likeed like with the
21:58relevant service basically yeah so let me give you context on that selum break right because you did a rule back but
22:04you didn’t really identify what the problem was and what changed and uh I don’t want is to debug
22:11something that you can’t have visibility into right now because of that secret values thing so in theum health chart
22:18what I did was disable the agent which is definitely rolled back because we can see the agent is now deployed next to the operator however I also disabled the
22:26ebpf cube proxy replacement and you may notice there’s no Cube proxy in this cluster so in the interest of not
22:34debugging something that we’re not entirely sure if it’s been fixed or not I’m going to redeploy celium right now and assume the r back hopefully fixed it
22:41properly and if we still have an issue then I’m debugging with you because I’m not really sure what the problem will be
22:46after that let let’s no maybe
22:55worse it’s not that okay so my my update for celium has
23:02triggered a redeploy of celium so the config map definitely changed so we may
23:08be moving in to a better
23:19situation yeah maybe delete the latest cium operator oh who can delete
23:25things delete the celium operator
23:32okay the previous one yeah
23:38the the operator wait a sec the one that is
23:43pending no not the one that is the other one what will happen yeah so go to the C operator to
23:51the are you sure yeah I’m going to delete the
23:56oldum oh that’s a bold move I like it yeah
24:02yeah we’re not playing around here you know so now the the new version is
24:09running and should or we won’t have anyone there
24:14rning right it seem like it’s face SCH hey that
24:20worked did it work yeah yeah well we had no doubts
24:26about
24:32seems like the new of theum doesn’t I think that’s okay because he
24:37has like two replica but now like it’s a new one that is running great so
24:44now I can scare it to one
24:49keep issues you know I’m scaling the the c one no no I
24:56think the no I think now it’s okay now let read the the logs of the flux thingy
25:01there the customized one I think right let’s The Source I think the C oh the Dr
25:08is just sinking it is yeah you see it’sing wall yeah let see
25:15it when a doubt delete stum operators fixes
25:21everything oh now it’s healthy look on the Yeahs and I do like
25:27a say okay let me share my screen and I’ll
25:33test the website for you right moment
25:38yeah and you understand like all you get is like druple working that’s
25:44like that’s like the best scenario Drupal is running we have a
25:49problem with our database configuration but maybe we don’t need it so an interest of testing we can go port
25:56forward not
26:03do also
26:09have okay so it’s almost working let’s see if we can actually open it in a
26:22Brer don’t be too happy
26:30you try to save it now he’s going to try to use the
26:35database so this shouldn’t actually be needed but the net script is unable to run for the same reason that this
26:42command will fail oh no our duple instance is unable
26:49to communicate with the postgress database back over to you and this is the last
26:54break because maybe the enir right it’s going to time out it cannot post
27:00dle cannot speak to post G there we go temp failure DNS
27:05resolution yeah back back to you last break there we go so it cannot resolve
27:12the database and okay so let’s check the
27:19events of the
27:24everything elction Network policy maybe you did a lot of network policy
27:31changes indeed why did you do it the event you can see policy changes
27:42and less Network policy [Music] change
27:48was I
27:54scraping so we saw that there are a lot of network policy changes and it look like someone changed
28:01the Untitled policy yeah there was a policy that prevent us for executing
28:09request the cluster there is a policy type of igress so let’s try to take on
28:15action and I mean what I love about comar here right is this the vent log as a gold mine of information and you can see this
28:22network policy was created in the last 24 hours it’s obviously well intended but you know mistakes are easy to make
28:28in kubernetes very easy
28:34then all right if you can stop sharing your screen I will give the application another spin I think we should be
28:40sitting pretty now Cas I still have my portf running if we remove the install script
28:48yeah we’re holding the view and if we make
28:56sure okay it completed 16 seconds ago the
29:03database is now running oh I shouldn’t have to do this
29:10but we run through it anyway that’s
29:16it woo well done you fix all the brakes on the cluster and duple is now working
29:23as [Music] intended
29:29so you know a small recap and then I’ll e get back up day right but that’s was a whole lot of fun for me right um I
29:36actually found it really difficult to break the developer the consumer API of
29:42kubernetes in a way that commodor couldn’t show right up front what the problem was with the GE integration the
29:49diffs the helm charts the node information even revealing all the labels and annotations everything was
29:55just there in front of me and I think that’s just superow for people that have to operate kubernetes so I’ll thank you
30:00all for your work it made it harder to break but I hope you enjoyed each of the breaks that were presented to you and uh
30:06yeah any final remarks from anyone no it was super
30:13fun
Share:
and start using Komodor in seconds!