Komodor is a Kubernetes management platform that empowers everyone from Platform engineers to Developers to stop firefighting, simplify operations and proactively improve the health of their workloads and infrastructure.
Proactively detect & remediate issues in your clusters & workloads.
Easily operate & manage K8s clusters at scale.
Reduce costs without compromising on performance.
Empower developers with self-service K8s troubleshooting.
Simplify and accelerate K8s migration for everyone.
Fix things fast with AI-powered root cause analysis.
Automate and optimize AI/ML workloads on K8s
Easily manage Kubernetes Edge clusters
Explore our K8s guides, e-books and webinars.
Learn about K8s trends & best practices from our experts.
Listen to K8s adoption stories from seasoned industry veterans.
The missing UI for Helm – a simplified way of working with Helm.
Visualize Crossplane resources and speed up troubleshooting.
Validate, clean & secure your K8s YAMLs.
Navigate the community-driven K8s ecosystem map.
Your single source of truth for everything regarding Komodor’s Platform.
Keep up with all the latest feature releases and product updates.
Leverage Komodor’s public APIs in your internal development workflows.
Get answers to any Komodor-related questions, report bugs, and submit feature requests.
Kubernetes 101: A comprehensive guide
Expert tips for debugging Kubernetes
Tools and best practices
Kubernetes monitoring best practices
Understand Kubernetes & Container exit codes in simple terms
Exploring the building blocks of Kubernetes
Cost factors, challenges and solutions
Kubectl commands at your fingertips
Understanding K8s versions & getting the latest version
Rancher overview, tutorial and alternatives
Kubernetes management tools: Lens vs alternatives
Troubleshooting and fixing 5xx server errors
Solving common Git errors and issues
Who we are, and our promise for the future of K8s.
Have a question for us? Write us.
Come aboard the K8s ship – we’re hiring!
Hear’s what they’re saying about Komodor in the news.
Webinars
You can also view the full presentation deck here.
Moderator (Julio): Good morning, good afternoon or good evening, depending on where you are in the world. My name is Julio Godinez, and welcome to today’s Devops.com webinar, “Why DevOps Tools Do Not Speak Developer Language (And How To Overcome This)” brought to you by Komodor. We have a great webinar for you today. But before we get started, I need to go through some housekeeping announcements. Today’s event is being recorded. So, if you missed any part of the webinar, you will be able to watch it again. We will be sending out a link to access the webinar on demand, or you can visit devops.com/webinar, and it will be available for you as well. We are taking questions from the audience throughout the presentation. Use your webinar interface to submit questions into the Q&A section, and we will try to get to as many as possible at the end. Finally, stick around until the end because we are doing a drawing for four $25 amazon gift cards so stay tuned to see if you’re a winner. Joining me today is Baruch Sadogursky, Head of DevOps Advocacy at JFrog, and Itiel Schwartz, CTO and Co-founder of Komodor. I’m now going to put myself on mute, turn off my camera, and let you begin.
Baruch: Alright. Hello, and welcome, everybody. It’s great to be here, and it’s great to be here with Itiel. We’ve already been introduced. I’m the Head of DevOps Advocacy with JFrog and my job is helping developers being more productive with DevOps, which is exactly the topic of today’s webinar. And here with me is Itiel, CTO and Co-founder of Komodor. And Itiel, the mic is yours. And you’re on mute. Still on mute. Oh, don’t you love those webinars? Nope. Best start ever. Okay, while Itiel is restarting, hopefully that will help. Oh.
Itiel: Yeah.
Baruch: Hey. Here we go.
Itiel: Did it work?
Baruch: Yes. But now we don’t see you. We only hear you. But oh… yes.
Itiel: Okay. Is everything working now? Can you hear me and see me correctly?
Baruch: Yes.
Baruch: It’s amazing.
Itiel: All of the preparation before the webinar, and this is what we get. So, hello, everyone. Sorry for the technical glitch. My name is Itiel. I am the CTO and Co-founder of Komodor. We are building the first Kubernetes -native troubleshooting platform. The goal of Komodor is to help Dev and DevOps teams to troubleshoot issues in Kubernetes easier without the hassle and complexity that the current solution provides. So, this is a little bit about Komodor. Maybe we’ll go and do a deeper dive throughout the presentation. So, I think we’ll get started, right?
Baruch: Yeah.
Itiel: So, I think I don’t need to tell anyone here that today’s world looks very, very different than the previous years. Deployment frequencies are just going up. We see GitOps taking off in almost every organization and we see how the KPI of the mean time to deploy is just getting smaller and smaller. So, in this new and exciting world, we see that Kubernetes has become the main de facto container orchestrator. But more than that, it’s not only the main container orchestrator, but it became the default platform for deploying applications in the cloud. Even with a small rise of lambda and serverless, we see how, for most organizations, most of the heavy workload is done in Kubernetes. Or if it’s not in Kubernetes yet, it is going to be in the future. Baruch, should I continue from here or do you want me to…?
Baruch: Yeah, no, no. I mean, so far, you are on point. This is one of the diagrams from the state of DevOps report. And obviously, this curve is what most of us experience. When we start automation, we get this little drop in the curve. Then we increase the test automation, we do more stuff with the machine, things are falling behind the cracks just because it’s not something that we are used to. But then we kind of get better with that, and eventually, our productivity, and the quality of our software surpasses what we could ever have imagined doing manually. So, yes, the challenges are there because of the complexity, and because we do things differently now.
Itiel: Yeah, completely. So, I’m going to start the discussion part with the first question out of a couple that we prepared, and feel free to ask questions yourself. Who should own the troubleshooting process in the age of Kubernetes? Let’s say we are a cloud native startup or company, everything is in Kubernetes, we are fully CI/CD, we are an elite performer in the DevOps space – and now the question arises, who should own troubleshooting? In the middle of the night, when a pager duty alert goes off – who should do the on-call process? Is it the developer? Is it the DevOps? Do I really need DevOps? Or maybe the SRE or platform observability team? Or some other name for a guy or girl, who mainly does the infrastructure related tasks and expertise? So…
Baruch: And given that DevOps… yeah, considering that DevOps is not the person, it’s rather a bunch of practices and that’s the middle arrow that does not exist. DevOps is not a person. So, we end up with a binary choice. It’s either the developers or the infrastructure folks. I think that the answer here is very clear, it is the infrastructure folks, mostly for 2 reasons. The first is they are the ones on call. This is kind of part of their job description. They’re there for this exact scenario, when things go wrong at night. And so, that makes sense that they are the first to start the troubleshooting process and to look at it. Frankly, between you and me, most of the time, it’s somewhere there in the infrastructure that that happened and that broke things. So, given that this is where most of the time the problem is, I think it makes sense.
Itiel: Interesting. I’m not sure that I completely buy the fact that it’s mostly the infrastructure. I will say that, most of the time, the issue is a change that happened somewhere across the system. Someone, somewhere, deployed something. Now, all of the sudden, Facebook is down, or we have a lot of issues in our system and in the GitOps world, or in the world where most deploys are not done by that, it’s not really related to that. Like, all I did is commit to master, and now all of the sudden the production is deployed.
And I really believe in a “You deploy it, you own it, until proven otherwise” mindset. So, I think the main reason that the Ops people were responsible for troubleshooting in the old days use to be the fact that they were the ones moving the code from the Dev environment into the production side. But now, if the developer is responsible, why should I, as an SRE, wake up and resolve this? If you wrote bad code, it doesn’t make any sense. “How can I help you? It’s your code, you should fix it, no?”
Baruch: So, I guess we are here in the essence of DevOps right now. We talk about this collaboration, existing or non-existing collaboration between the 2 parts of the house, between the Ops, that kind of is de facto on call team and observe and monitor the systems and they go like, “Well, we have no idea what to do here. Looks like the code is broken. And we have no clue what to do.”
Itiel: Yeah, yeah. I think we see more and more of this, mainly as we see the ratio between the Dev and operation people. Operation can be SREs or infrastructure, or it really doesn’t matter. However, in the end of the day, in most organizations, we see it’s like 10 to one, and finding a good operation person who knows Kubernetes and knows what is really going on, is super hard and super difficult. So, I think, at least some quite large chunks of the troubleshooting process should move from the SRE to the Dev team. Like, “you build it, you own it, you troubleshoot it,” kind of mentality.
And I think that, moving to a more solid infrastructure can help developers move much faster. It is unavoidable that some part of the process will be routed to the dev, which will probably need to be on-call. We do see this already in a lot of modern organizations. Okay, so let’s move to question 2. Baruch, do you want to say something before we move ahead?
Baruch: Yeah, no. I think we highlighted here the problem that the infrastructure folks are still the ones in charge but most of the time, or some of the time, they don’t have the right answers.
Itiel: Yep. Yeah, I can tell you, by the way, as an anecdote – Komodor is a Kubernetes troubleshooting platform and we sell to a lot of customers. We talk with SREs, we talk to the developers, we ask them like, “What are your pains? How can Komodor help you?” and we’re happy to help companies overcome those challenges. One of the biggest requests, however, come from the Dev team. We keep on hearing the SREs telling us, “If all you do is allow the developer to solve 20% of the issues without talking with me, it would be amazing. This is all you need to do, and I will sleep better, I will eat better, I will look better.” This is everything they really need to make their lives easier. And I’m happy to say that we do provide this value, but it’s so apparent that they don’t really want to troubleshoot as they used to. Yeah. Okay, so now let’s go to question number 2, “What is blocking organizations from shifting left?” So, Baruch, any answer?
Baruch: Yeah, yeah, of course. So, you can see here the answers, the obvious ones. The ones that are pretty trivial to solve is the lack of Kubernetes knowledge, then you go and you learn Kubernetes. Lack of permissions and access – that’s again a cultural thing. It’s about trusting your workforce. Basically, it’s about trusting the engineers inside their organization. That alone results in a whole new discussion about how security should be done in the age of DevOps, and whether it should change from what we used to do. But that’s a discussion for a different time.
And the 2 bottom ones, I think, are more interesting in regards to our discussion. The lack of cultural mindset – this is exactly what we spoke about on a previous slide. This is exactly the part where actually, the people on call, the infrastructure folks, are in charge for whatever happened, and they don’t even know how to start and shift left to the developers. They don’t even know how to go ahead and kind of pass the baton to the other side of the house.
And even when this is solved on the mindset level, when we say, “Okay, developers should participate,” and the other side understands that as well, “Yes, we want to help. Wake us up when something breaks down at night, or we will sit with you in the war room where the problem happened,” then the next question should be what we as developers can do. We sit there, we have our IDE open, we have our code, now what do we do? How can we participate in the process? And how can we contribute, when we are ready to take this responsibility?
Itiel: Yeah – some developers (I don’t want to say everyone) – they don’t really respect the first level. They don’t want that ability to troubleshoot. Because when you need to troubleshoot when you’re on call, then it means that you’re on call. It means that they will call you during holidays. You will be called in the middle of the night. I’ve done on call for most of my career. It’s not fun. No one loves to drop everything he’s doing and basically try and save the day. It’s not a fun activity. And I think this is, again, one of the first issues.
I think the lack of permission and access is the most interesting one. I think that we will talk about the cultural mindset and lack of the developer-friendly tools on the next slides. But the lack of permission is super interesting, because I’m talking with another organization and asking them, “What is your troubleshooting process? Like the system is breaks down – does that happen to you?” “Yeah, sure. It happens a lot.” “Okay, great. And who handles it?” “The DevOps or the SRE.” “Okay. Why can’t the developer troubleshoot the issue himself?” And then they are like, “Yeah, but they can’t access the logs in production.” “What?” “No, like, you can see all of the resources.” And I said, “What? You want him to deploy, you trust him in a fully CI/CD manner, he can change and basically crash everything that is currently happening in production on one hand, but on the other hand, you’re telling me that you can’t let him read the logs because it’s too sensitive and it’s not a possibility?”
I think it’s madness in a lot of organizations – cause on the one hand, they trust the developers so much, but when it comes down to, “Can he view the logs? Can he revert? Can he view AWS console?” the answer is no. If an issue happens and they’re wondering “How do I fix an issue?” they need to run a revert script. And when I asked them, “Who has the permission to run the script?” it’s only the DevOps. So, the developer can deploy a new version, as if this is not an issue. But once you need to run a script, the chain production, then only 3 people in the organization are allowed to do it. And they’re like, “Yeah, yeah, it is the case. It is the case.”
I think the lack of permission and access is the easiest thing to handle and overcome. But even with that in mind, it’s not really happening. I do see a lot of organizations that give all of those permissions. But sometimes, the DevOps guard them so, so tight and close to the chest, that it is quite absurd. I was just talking the other day with a SRE, and he’s like, “Yeah, I’m the only one who can restart pods in production. Developers can deploy code, they can change things, they can change feature flags that might crash everything. But yeah, like restarting a pod or like deleting a pod, no – I’m the only one who can do it. Yeah, it’s really problematic. And I wish somehow I could trust them.” And I think this is an old mentality of, “Who owns the proper troubleshooting process and how do we troubleshoot?” In my opinion, it comes down to the essence of both of them.
Baruch: Yeah. And that’s a discussion about security access. And it’s also a cultural thing. But it’s different. And it’s there because of the old ways that security is being done from an auditing perspective, from locking everything out of perspective.
Itiel: Yep, yep. Exactly. Let’s move to the next one.
Baruch: Yeah. Go ahead Itiel. Yeah. That happened again, can’t hear you. Yep. Nope. Alright. So we get DevOps in terms of culture. We know that it’s all the way, we shifted left, for example, deployment. We shifted left deployment, and we know that the developers now, once they commit, it goes all the way to production by itself. And this is great. We do not commit the code anymore and go home, like we have here in this meme that, “Worked fine in Dev, and now it’s the Ops problem.” And actually, when we do the deployment, when we create stuff, we take ownership as developers. It goes all the way to production.
But when it comes to incident response, suddenly, our gap is there again. In the incident response cycle, we suddenly have those silos again. Suddenly, the developers and the Ops own the process, and they have the access, and they have everything. And from the other side, the developers are, in worst case, not there, in best case they are, but really powerless or helpless to participate in the process.
So, we made very good progress in DevOps when it comes to deploying new code. But it feels like, in a lot of places, we’re still there in the old ways when it comes to incident response. And the question is why?
Itiel: Yeah. Can you hear me now, Baruch?
Itiel: I can and I can also answer on the why part. We talked about it in earlier slides, but it’s a combination of 2 things. If everyone who is deploying to production is the DevOps, it makes sense that he’s going to have the mindset, when it comes to incident response, that he’s responsible for the change. He is responsible to check the change. He should troubleshoot the issue. It makes total sense. On the other hand, in organizations where those type of tasks were taken from the Ops and moved into the dev, I think the technical part is maybe easier.
I think the main issue with the culture gap is it takes time for people to change. It’s as simple as that. They’re used to being the heroes. A lot of times, operations used to be the cowboys that storm in and save the day. And for some of them, it’s harder to let go of the power of like being the heroes and saving the day.
I think, when you talk with the SRE or DevOps, you see 2 different kinds of personas. I’m going to be very stereotypical, but they are the ones who are more old-fashioned, “I should own everything. I should authorize everything. I should troubleshoot,” and so on. And you see the new people, they’re like, “Yeah, I don’t want to do it. I don’t want to wake up in the middle of the night. I’m getting the same salary. I will sleep like a baby. And it’s great.”
And we see 2 really big differences in the personas – the ones who are owning all of the troubleshooting processes and the guys that are telling us, “I don’t know, like, everything is down. And the developer, I gave them so much tools and help, and they still failed to do it.” And then the SRE/DevOps think, “Yeah, like, I wake up in the middle of the night, but dev, they have no idea. Can they understand Kubernetes? Do they even know what an Ingress is? How can I trust them with the keys to troubleshoot when the production is down?”
And I think it’s very much the state of mind. When you look at new companies or new people that are coming and joining the field and are becoming a SRE or DevOps, you see people with more of a Dev mentality of, “Let’s automate everything. Let’s move it from us.” And I think it’s a cultural thing that, as time progresses, we shouldn’t see in 10 years from now. I hope so at least, but this is how I see it.
Baruch: Yep.
Itiel: Okay. Okay. So, now, we talked a lot about the cultural issues and what the current state is and who should own it. But let’s say we have an issue, which is in production, and the developer for some reason is the one responsible for the troubleshooting part. So, now there is the question, maybe it’s not true, well, I’ll be happy to hear your thoughts, “Why do current DevOps tools not speak developer language?” Or maybe they do. So, Baruch, what’s your take on this question.
Baruch: Yeah. No, they definitely don’t. And I already mentioned it when we spoke about it in the intro. Even when we solve the cultural problems that we spoke about in the previous slides, in the end of the day, the developers feel powerless when it comes to the existing tools that they have in order to respond to incidents.
The answer as to why this is the case lies in the history and the culture that we spoke about previously. Incident management is seen as an infrastructure related topic, therefore the tools naturally follow the culture, meaning, they become all very infrastructure-oriented. However, when the culture changes, which is what we are seeing now, there is this gap.
Itiel: Yep. Yeah, the way I see it, when a SRE is troubleshooting an issue, he wants to understand everything that is happening in the system. He doesn’t really understand, or care about “how is my service” acting in Kubernetes. It’s such a small part of everything he is responsible for, meaning, he wants to get the macro level view of it all of the time. He wants a big dashboard telling him everything is okay – and if not, he wants that ability to understand the high-level view of what is happening.
When you look at tools, such as Datadog, or New Relic, which are great tools (we’re using that internally by the way), they have a very macro-level view in mind. They are like, “Yes, see everything. Understand everything. Build your own dashboard.” They have built already a great system that allows them to get a very, very good high-level overview of what is currently happening in your system, and enabling you to monitor it.
So, this is like, from the one hand, what those tools are made for and the result. When you think about a developer that is troubleshooting an issue, he knows a very, very small portion of the organization. He’s not responsible, for example, for Kafka stability, the network layer or the firewalls. Like, he just doesn’t really care. He knows only one particular service he is working on. Maybe he knows a couple of other services that this specific service is talking with. He may know the load balancer, but that’s pretty much it.
So, this developer, when he troubleshoots an issue, he doesn’t want all of this great power and great responsibility that the current DevOps have. He doesn’t want the understanding of what is currently happening across everything. He just wants to understand, first of all, what is happening in his small piece of heaven, basically – and if there’s issues on his side, he wants to understand so he can troubleshoot it. In summary, he wants a micro-level view of the world, and maybe like a small glimpse of what is outside of his/her domain.
And I think that most of the SRE-oriented tools are really missing the point here. Because our developer, when he goes into Datadog, he has so many different options, capabilities, filters, dashboards and so on, that it makes it really hard for him or her to understand what is happening, because it’s not the mentality that he/she has.
I will say that, on the other hand, if he will look at a tool such as, I don’t know, maybe like Sentry, or Bugsnag, or even Jaeger in the tracing world, which are Dev-first tools, they will feel like, “Oh, this is my application. Those are the environments. It has those exceptions that are currently happening.” And when you go to Datadog, it’s like (or Prometheus, it doesn’t really matter), “What is happening here? Why do I need so many dashboards? Why do I need so many graphs? Why is this graph going up? Why is this going down? Does it make sense?”
And I think the transformation of making those 2 more suitable for developers, the transformation didn’t really even begin – it is still so early on. Here at Komodor, I will say that we have the dev-first mentality. We helped build the Dev and the SRE perspective into our solution because we understand that both of them need to troubleshoot the issue. I will say that, for most modern tools, we see how they pick usually one side because they think, “He’s going to be the one who’s going to use it,” while basically not giving the easy way in for like new tools and new users. Sorry.
Baruch: Yep. Yeah, absolutely. And Ana here comments that all of the tools, New Relic and App Dynamics and what not, eventually can give you the info that you need. Obviously this is true. The real question is, how much effort and how much knowledge do you need in order to get the information that you need from those tools, considering that you are in the middle of an incident, and obviously, you are under stress, and you want to use the tools that give you as much meaningful information as needed? And while you know what? You can do everything with everything, and you can look at the machine code and understand what went wrong, but the real question is, how much of an effort do you need to put there, under stress, in the middle of a fire, in order to get the right response?
Itiel: Yeah. I will say, Ana, I agree that with New Relic and App Dynamics, they have all of the data. Everything the Dev needs exists in those tools. And I think they do have everything. But I think it’s hard to get started and to understand what is happening.
Baruch: Exactly.
Itiel: If you want a metric, it’s there. If you want a view, it’s there. You can customize it and do everything. But it’s a power user tool for most people, and not a simple dev-first tool, unlike maybe like other tools that I mentioned.
Baruch: Absolutely.
Itiel: But I agree that those tools are great.
Baruch: Yeah, then they are great. The real question is, who are their target audience? And if the target audience that we speak about changes, how easy is it for the other, for the new target audience, to access it? And Ana mentions here that both Ops and Devs need to learn them, and this is 100% correct. But the mindset of those groups are very different. Their specialization is different. What they know and accel in is different. And it’s very hard to create a tool that successfully caters for both audiences with exactly the same level of experience. And this is fine. It’s not a problem. It’s just something that we need to be aware of.
The whole DevOps thing is not about making Dev and Ops exactly the same. It’s about building a collaboration between 2 inherently different groups that need to be different for the right reasons. So, when we look at the tools that they use, the tools are different. And yes, obviously, a developer can master any complexity of the Ops tool, and the other way around. Because, in the end of the day, both groups are software engineers, and most of the time talented and knowledgeable software engineers. But when everything is on fire, is it the right place and the right time to bend, to get out of the comfort zone and learn the tools of the other mindset?
Itiel: Yeah. I think we have a total agreement here about the different specialties, and the need or the difficulties to master them. Okay, so how can we bridge the gap between the Dev and the Ops team? So, now we are going to move a little bit to best practices. And, Baruch, do you want to start?
Baruch: Yeah, absolutely, of course. So, the best practices are, first of all, I kind of a little bit disagree with you here about troubleshooting independently. I think the right term would be troubleshooting together. This goes back to our kind of original disagreement, if you wish, about whose ownership it is because I still think that the infrastructure people kind of lead the charge, and then the developers play a very important role, but they work together with the infrastructure people. I really don’t see a scenario when the developer is on call and infrastructure people are not. That sounds to me just wrong, as wrong as the other way around. It’s all about doing it together. It’s all about not independently, but together, one with another. Yes, probably using different tools, as we just spoke, but it’s a common effort. It’s a joint effort that needs to be done together.
Itiel: I think I do see the world a little bit differently here.
Itiel: Mainly because I think that I see the stress or the amount of things the SREs are doing. And I think it can’t really scale going forward. I believe if the developers want to be able to take a very, very big chunk out of the troubleshooting world independently, something will break. Perhaps more developers will be SREs. I’m not sure how it’s going… maybe they will write less bugs. But I’m not sure how the world is going to look like, mainly because the numbers don’t really add up.
I think what you said about working together – I think it’s like a pyramid, a test pyramid, or like any kind of pyramid. I believe most issues should be handled by the developer. I think the infrastructure should be quite reliable, it should work most of the time and the developers should answer and fix a lot of the common issues.
Baruch: I don’t think we are in disagreement there. Of course, we both agree how important it is for the developers to be a part. The question is, again, independently? M/y kind of question for this slide is, is independently really what is going to happen, or are we talking about developers participating on a call and being on par with the Ops people in solving everything that has to do with the code changes? But independently, I think that we’re going all the way to the other extreme. That was kind of my point.
Itiel: Yeah. Like how I see it, going forward, you have 5 things – for example 5 Dev teams that are maybe all on on-call duty. Each one represents his/her division and you have one SRE on the call. Most issues are not really escalated to him, instead the Dev are basically solving it. Only when an issue is big or super complex that you need a SRE involved, then yeah, sure, like he should handle it. Another option is if the database is down, this is one of the tricky parts. If the database is currently down, don’t wake up all of the 5 Dev on call. Like, what are they going to do, watch SREs solve the issue and do a DB migration rollover, something like that?
So, the interesting part is, first of all, the ability to triage the issue, to understand who is the relevant persona that should handle the situation here. And yeah, I do believe that going forward, once the infra is more reliable, most issues hopefully will be solved independently. And regarding the other issues, it should be a joint effort. Yeah – I see that the Facebook root cause here in the chat was an info failure.
Baruch: It was, it was.
Itiel: I think in the end of the day, a developer most of the time, crashes only a small part of the system. And the cool thing about being an SRE is, when you do things, you do them with a bang. And basically, it’s not only a small widget in Facebook which is currently not working. Here, you broke health of the internet, and now no one can go and watch Netflix, Twitter, Facebook, or Instagram because everything’s crashing.
Itiel: So, yeah, yeah. It’s a different blast radius that the infra team has, and the developer team. Let’s now discuss the last question – what are the recommended tools to empower Dev teams to troubleshoot independently, or maybe not independently? But, Baruch, what do you think are the tools that are currently missing, or are best in helping bridge the gap between the developers and Ops teams?
Baruch: Yeah. So, what I would like to see, what I would love to see is a tool that brings developers into the context of what went wrong because the symptoms of a problem are usually Ops related. We can say, “Well, this part of the system is down or misbehaving.” We can say like, “We see in the logs, or we see in our monitoring system, the transaction takes too long, or this times out, or this is not accessible.” And those are Ops or infrastructure signals that we get. How can we take those signals and see it in the code where the developers live in in order to understand what went wrong? This is the challenge. Those are the tools that I would love to see emerging. When we are in this together in incident management and we see an Ops-level input, we know how it translates to Dev-level places in the code when things go wrong.
Itiel: Yeah, yeah, I agree. I think that I can say that here at Komodor, this is what we focus on too, the ability to empower both the Dev and SRE and to give each of them the right amount of context to troubleshoot the issue and to understand what is currently happening in a Kubernetes cluster. I think that logging tools, such as Kibana, are so customizable. They are the common ground for basically everyone. You see how the developers have their own dashboards and their own logs and so on – and you see the SRE teams looking at their own dashboard. If you have the NOK team, maybe they have their own dashboards and visualizations as well.
I really believe that this is a good tool because it’s so customizable, and it’s quite easy. If you go into Kibana, at least to get started, you have one very simple bar, search for everything you want, and you will get the answer you want. So, I think those tools are doing a great job in bridging the gap and allowing both the SRE and the Dev the same ability to view the world.
And other than them, I will say that tools such as APM are also right in the middle of the SRE and the Dev. This is because, in an APM tool, like application performance monitoring solutions like Jaeger, Zipkin & some parts of Datadog, the core of the APM is the application, which is a developer’s responsibility. I think like the rise of APM is basically the industry trying to speak in “both languages” for the 2 target audiences. I think we don’t see a huge adoption yet because it is new and it is hard and to understand distributed tracing, it’s complex.
Baruch: And not only that, I think most of the time it still doesn’t go all the way.
Baruch: It doesn’t go really into developers’ tools, and that kind of highlights where the problem is. Think about how advanced the tools for the infrastructure part of the house is. You can know up to a line of configuration, what went wrong. When we compare it to the Devs part of the house, we’re very, very far away from that. Yes, APM takes you closer. Yes, they can show you performance metrics on some part of the code that was committed, sometime, somewhere, by someone. But it’s very, very far away from giving the developers right there and then, right in the middle of the incident, the answer, “Hey, this code that you wrote broke everything. Go ahead and fix that,” and it will fix the problem.
Itiel: Yeah, I think this kind of understandability is the exact thing that we strive for here at Komodor. But it is hard, because everything is so intertwined between the DevOps configuration, the chart just changed the firewall rule to the developer and the application that was set up with the new IM role. I don’t know. We see these worlds are really intertwined and you need to get the context and understand it in order to know what is happening. You need a tool that understands those different corners of your application, both infra and the app level, and to somehow make sense out of it.
I think that, here at Komodor, we are doing exactly that but it is very tricky. There are a lot of dark corners of troubleshooting and to finding the root cause. There’s a reason it isn’t solved yet.
Itiel: Sorry. Ok, let’s move to Q&A. Or Baruch, do you want to add anything?
Baruch: No, I think we’re pretty much done and have 10 minutes for Q&A. The only thing that I want to mention is that you folks really need to look at what Komodor does. I think Komodor tries to answer exactly that. And the part that we spoke about, the tool, the missing tool for developers to be as productive during incident management as the Ops are now, I think you might find very well that what Komodor does is the answer for this big gap in our tool chain for incident response.
Itiel: Thank you. Well, thank you. It’s funny, you said that we try to answer the question here. I think that when we speak with developers and SREs, one of the things that becomes very clear to us is that developers don’t really know what questions to ask once they have an issue. It’s funny you said it, but basically, I mean, I have an issue. What should I ask now? And where do I ask it? Like, those are the 2 main things that separate a lot of the times between the newbies and the experts. They know what questions to ask, and they know what tools will give them the information. We tried to crystallize it so even someone less experienced will have the questions and answers in place.
Okay. So, let’s move to the Q&A. I see we have a couples of questions from Victoria. So, the first question, “Is your solution helping provide better visibility and context to troubleshoot Kubernetes issues?” So, yeah, we talked a little bit about Komodor. I will say that Komodor is a great tool, but I’m really biased. The other tools that were mentioned here, such as Datadog – we don’t replace them. We are good friends with them. Also, I think if you don’t have Jaeger for example, or any other observability tool, you should go and set it up right away if you’re using Kubernetes. If you don’t, it will be really hard to understand what is happening. Yeah, Baruch, do you want to take the second question?
Baruch: Yeah. So, “We’re about to start migrating to Kubernetes. Do you have any tips to ensure a smooth process? What should we consider?” Yeah, well, we spoke about some of that today. Kubernetes is extremely complex. I think one of the most popular resources of learning Kubernetes is “Kubernetes, The Hard Way” by Kelsey Hightower. Yes, the cloud takes a lot of this complexity away. When you take Kubernetes as a service, from whatever cloud provider, you prefer Amazon, Google, Microsoft, they make your life much easier, because they hide a lot of this complexity. But when you’re talking about taking it to production and being ready for those incidents, you actually have no way but to learn it inside out.
And that, again, brings us to the question, “Who should learn it? Is it an Ops concern? It’s a deployment platform. I, as the developer, don’t care. Let the DevOps people, whoever they are, learn this crap. I’m just going to write code.” Or should we be involved in knowing how Kubernetes works, how it should be troubleshooted in the case of an incident? And obviously the right answer is yes, the developers should know how to troubleshoot Kubernetes in production, because it’s their code that runs there. And it’s definitely their concern as well.
So, I would say, make sure that you know what’s going on, and get ready with the tools that will help you, not when everything is great and you just deployed your HelloWorld to Kubernetes and you’re on top of the world, but when actually things go down, and you should know what to do in both the Ops and the Dev part of the DevOps house.
Itiel: Exactly. And the last question is, “You spoke about lack of access. Do you have any tips for best practices and who you should grant access to, and to what exactly?” So, I’m a believer that, at least from the view side, you should give it to anyone who is on-call. I don’t see a way where you want the developers to handle production issues, but they can’t access production. If you are a developer that is currently working in a company that expects you to solve real production issues without giving you the context you need, you should say, “It’s not going to happen. What are you doing? I’m handcuffed here, and you should give me the access.”
I believe in building trust and trusting your developers, and having the right security mechanism in place. Let’s not allow anyone to demolish production services, like a database. But regarding viewing what is currently happening in Kubernetes, I think you should give at least some kind of minimal access – or have some kind of system that allows the developer or the people in charge to view everything. It can be a combination of Permit Use and Kibana, or any other combination, it really doesn’t matter. The bottom line is, you can’t put a blindfold on those areas that are required to troubleshoot.
So, anyone who needs to handle production issues, having at least a view is a must to take action. I think if you don’t trust the people, you should write a very large set of scripts in Jenkins, ArgoCD or wherever else you run your scripts in, and allow them to trigger it manually via auditing environments. You should create an audited, safe way, maybe with source control, to allow them to take actions to solve the issue.
Baruch: Yeah. And I agree. Again, we’re coming back to helping the developers doing the right thing without making it easy to destroy things – and tools help. When we speak about granting access, the right question is, granting access through what? And if we have the right tools that mitigate the risks of people doing the wrong things, but providing the options for people doing the right thing, I think this is the key for doing access control the right way.
Itiel: Yeah. I think we’re pretty much done and quite on schedule. And so, Baruch, I really enjoyed talking with you about this. Really big shout out to you and to JFrog. Anything you would like to add before we finish up?
Baruch: Yeah. So, Itiel, thank you for hosting me. I think that was great. And as I already mentioned, folks, you are more than invited to take a look at what Komodor does. I think it should definitely solve a lot of those questions that we discussed together, as long as they’re relevant to you. And they should. They should be relevant to almost all of us. And while tools come third in these people-process-tools equation for DevOps, they are still very, very important, because they also, not only support the culture, they also enable the right culture and move you toward the right culture of decision. So, definitely take a look at Komodor.
Itiel: Yeah. Okay. Thank you very much, everyone.
Baruch: Thank you.
Julio: Excellent, excellent. Okay, just a reminder that today’s webinar has been recorded. So, if you missed any or all the webinar, you will be able to watch it again. We will be sending an email with a link to access the webinar on demand. It will also be available on devops.com. Just look in the On-demand section in the Webinars page, and it will be there.
And now to announce our 4 amazon gift card winners. Our winners today are Antoinette S. Congratulations. Douglas F., Ana S., and finally, Steven O. So, congratulations to all of our winners. We’ll be reaching out to you via email with instructions for claiming your amazon gift card. So, please check your inbox. And if you don’t see it there, just check your spam folder. Thanks again to Baruch and Itiel for an excellent webinar. And thanks again to the audience for joining us and for your engagement. This is Julio Godinez, signing off until next time. Be well thank you.
Itiel: Thank you.
and start using Komodor in seconds!