Reliability-Driven Fleet Management with Komodor

Itiel Shwartz. Reliability-Driven Fleet Management with Komodor
Itiel Shwartz
CTO & co-founder

Hi everyone, we’re going to start the workshop in a minute. I’m just going to let everyone join in. In the meantime, we’ll do some quick housekeeping. Hi, I’m Udi from Komodor, and with me is Nikki. Today, we have the pleasure of hosting our CTO, Itiel Shwartz, who will tell us all about Kubernetes fleet management. If you have any questions during the session, feel free to drop them in the chat below. We’ll leave time for Q&A at the end, and the deck and recording of this session will be shared with everyone here, so don’t worry about that.

I think we are ready to get started. Itiel, welcome to the live workshop. So, tell us what we are going to talk about today. What is fleet management, and why is it even important?

Thank you. Should I start sharing?

As you wish, you’re the captain today.

Okay, let’s do it. Maybe some of you folks registered and wanted to see Arthur here, who leads our SE at Komodor, but sadly he is sick, so I’m here as a replacement. Don’t be alarmed if you were expecting Arthur. Just a bit about me: I’m the CTO and co-founder of Komodor, with a lot of background working with Kubernetes in high-complex, low-latency environments. Overall, I really like the domain and the space of the Kubernetes ecosystem.

What is a Kubernetes fleet, and why should you care or who should care? Kubernetes started with the cool concept of allowing developers to take their container and deploy it easily. This created a new mentality of cattle versus pets, treating applications as cattle. On the other side, it created a new entity, the cluster, responsible for managing and running all those different pods. When we talk about a fleet of clusters, we usually refer to anything above 30 or 40 clusters. In the current ecosystem, managing dozens, hundreds, or thousands of clusters has become a significant problem. During our call today, we’ll go over the challenges, possible mitigations, and do a quick demo on how Komodor can help with fleet management. The goal is to talk more about the problem and how to resolve it, even regardless of Komodor.

So, we see that the trajectory is towards more and more clusters, right?

Yes, the trajectory is indeed going up. As Kubernetes becomes more popular and more enterprises move their stack to Kubernetes, what was once rare, like having hundreds of clusters, is now common for large enterprises. We are starting to see cases with 500 or 600 clusters. The CNCF and other tools are trying to tackle this fleet management problem. Kubernetes wasn’t very popular five or six years ago and wasn’t production-ready for many people. But as big companies move into the domain, supporting infrastructure and scaling problems emerge. Kubernetes has reached a certain maturity in managing applications on a single cluster, leading to new problems as more enterprises adopt it.

When we talk about fleet management, we usually refer to platform engineers or platform teams. These are the people responsible for providing the supporting infrastructure to the rest of the organization. Organizations often reflect their HR map in their technology setup, with different business units having their own requirements, regulations, and developers. The centralized platform team is usually responsible for managing the fleet of clusters. This poses two main challenges: the technical aspect of managing numerous clusters and the human factor of ensuring the rest of the team is empowered and self-sufficient. The goal is to achieve a state where everything is reliable and easy, but the reality is more complex.

Managing a fleet from a technical aspect involves cluster life cycle, access management, cost, resource utilization, governance and standardization, reliability and resiliency, and cross-cluster visibility. Most of these problems also exist for a single cluster, but they scale up with the number of clusters. The complexity of managing clusters, ensuring cost efficiency, governance, and access management becomes huge with numerous clusters. Many people revert to old tools like Ansible, Chef, or Puppet to manage a fleet of clusters, which feels contrary to Kubernetes’ promise of eliminating the need for such tools.

Fleet management is also about dealing with clusters that are not the same. Different clusters are used by different personas with different expertise levels and expectations, leading to chaos. Platform teams struggle to find the right abstraction and balance of power to give developers. It’s a challenge to trickle this down through the organization while ensuring everyone is at least not unhappy.

Managing hundreds of clusters requires some control plane for the fleet. This involves using infrastructure as code (like Terraform, Crossplane, or Cluster API), GitOps for standardization, monitoring and observability tools, and internal tools or systems for fleet management. The goal is to make developers self-sufficient, similar to how AWS users don’t need constant support. Platform teams aim to empower developers to solve issues independently, achieving a shift-left approach.

It is important to remember that hiring good people in a platform team in a Fleet Management situation is hard and very costly. You can’t scale your team the same way you scale your cluster numbers, so you must empower your developers to achieve a real, efficient, and healthy state of Kubernetes Fleet Management.

At Komodor, we also have quite a lot of Fleet Management capabilities. What you see here are different clusters running in different areas around the world. I’m going to do a demo very soon. I think now it’s time for the demo, but before that, I want to open it up for questions. If someone has any questions, please do it now.

Itiel: I’m not sure if they can ask questions.

Udi: Yes, they can. They can write a question directly in the chat below.

Udi: Let’s give it a minute or two. People are shy.

Udi: I actually have a question while people are scratching their heads and thinking. You mentioned the human factor, and you’re also the host of the Kubernetes for Humans podcast. How important do you think is developer empowerment? Because one solution is to automate everything or just have a huge platform team that consolidates everything, and developers only need to write code. Are there other ways to do it? Do you think that none of it is complete without also onboarding more engineers into the Kubernetes world and offloading tasks to them?

Itiel: I think you can’t reach a really scalable solution without spending money on a lot of DevOps, but it’s almost impossible to do this shift left. It’s one of the two, and you can’t really achieve both. What we see is that a lot of the time, customers reach a certain point of scale, then start going backwards because developers are blocked and unable to act alone without help from the platform team. You see teams that moved at a certain velocity slow down because developers were simply not empowered. Kubernetes adoption can turn out to be the opposite of the expectation.

Udi: Sadly, that’s what we see. Companies do eventually find ways to do it by building relevant tools and processes, but it’s costly and time-consuming. Do you see common patterns among larger customers for solving these issues or different approaches?

Itiel: The main thing is to set good expectations regarding what developers should know and do. Ensure they have the right infrastructure and tools, and spend time on training, workshops, and tools. Over time, this starts to pay off. It requires tuning and cultural changes to make it happen.

Udi: One final question before we move on to the demo: How do I know if I’ve reached the point of becoming a fleet? What’s the tipping point?

Itiel: It’s a question of toil and escalations. How much time are you spending on manual work or fixing your clusters compared to before? If this time is increasing linearly or exponentially, you’re in the wrong place and should look for tools to help or change your operations. It’s about how much time you’re spending and supporting different teams in your company.

Udi: Can you automate everything and eliminate the need to empower developers, maybe using AI?

Itiel: AI does play a significant role in helping developers improve their work and self-empowerment. At Komodor, we’re investing in using AI to streamline processes. However, like autonomous cars, while everyone wants it, no one trusts it entirely yet. We’re not there yet with AI, and we won’t be for the next decade. We should use AI to empower developers more like an iron suit than as a replacement.

Udi: If no one else has questions, we can move on to the demo.

Itiel: Let’s do the demo. What you’ll see here are Komodor’s cluster capabilities, which are in beta for some customers and will go GA next week. This feature shows all different clusters in one single pane of glass, across Azure, AWS, Google, or on-prem. Komodor focuses on two main aspects: critical issues and scores over time. Our goal is to make managing Kubernetes clusters easier, ensuring they become more reliable and cost-optimized. We offer users the ability to understand and fix issues before they escalate, combining static analysis with real-world results. For example, we can guide you through upgrading processes or managing Noisy Neighbor issues.

Violations in Komodor indicate problems that, if unattended, will escalate. For example, an end-of-life cluster in AWS could cost an extra $20k to $30k per year if not upgraded. Noisy Neighbor issues, where one service affects others on the same node, are another example. Komodor offers dozens of reliability checks, policies integration, and cost management tools to optimize your clusters. Lastly, our RBAC management allows centralized configuration and auditing for all clusters.

Customers using Komodor typically see a 60% to 80% reduction in toil and escalations, leading to faster development and higher organizational velocity. Any questions?

Nikki (reading Bruno’s question): Do you also take add-ons versions into account for Kubernetes minor upgrades like kube-proxy, VPC, CNI, cordons, etc.?

Itiel: Great question, Bruno. We’re adding capabilities to understand the compatibility between add-ons and upgrades. If you’re an existing customer, talk to us to join

the beta testing for these features.

Udi: We remember a case where a customer’s CNI breaking change went unnoticed until Komodor helped identify the issue. The new features will cover more of the Kubernetes ecosystem, including Helm chart analysis.

Itiel: Absolutely. Managing a fleet of clusters and supporting platform engineers is crucial. If you’re facing issues, reach out for a free trial or contact me directly. Bruno hope the workshop was insightful and helpful.

Itiel: Thank you, Bruno. We appreciate the questions and engagement. Feel free to reach out for more information even directly to me..

Udi: Thanks, everyone. We’ll share the recording and deck with all registrants. Stay tuned for our next event at the end of July, and wish Arthur a speedy recovery. Thanks, Itiel, for stepping in on short notice. See you next time!

Itiel: Thank you, everyone. Have a good day!

Please note that the text may have slight differences or mistranscription from the audio recording.