Hidden Signals in K8s Clusters: A Data-Driven Approach to Reliability

Andrei Pokhilko. Hidden Signals in K8s Clusters: A Data-Driven Approach to Reliability
Andrei Pokhilko
Open Source Dev Lead

Presented by Andrei Pokhilko, Komodor’s Open Source Dev Lead

Please note that the text may have slight differences or mistranscription from the audio recording.

Nikki: Welcome, everyone! We’re excited to have you join us for this deep dive into Kubernetes reliability insights with Andrei from Komodor. Andrei brings a wealth of knowledge and practical experience to the table. Today, he’ll walk us through some of the coolest insights they’ve developed at Komodor to enhance Kubernetes cluster reliability. Andrei, take it away!

Andrei: Thank you, Nikki. Hello, everyone. It’s great to be here. Today, I’ll be sharing with you some fascinating insights that we’ve developed at Komodor, focusing on how we leverage Kubernetes data to improve reliability.

First off, when we examined very active clusters that heavily utilized spot instances, we noticed a consistent correlation between reliability issues—such as backoffs and pods not spinning up in time—and the termination of those spot instances. We realized that by analyzing pod termination statistics, we could identify which workloads were most affected by these terminations. Although this particular insight didn’t make it into the final product, it’s not completely removed. It’s currently disabled as we look for ways to make the calculations less expensive. However, we know that some of our customers would love this feature, as it directly impacts how spot instance usage affects their reliability.

Another significant insight we developed is related to group failures, or the analysis of historical patterns of failures that occur together. For instance, if workloads fail in quick succession over prolonged periods, we can identify that some services live as factual dependencies. Using the DBSCAN algorithm over a timeline, we could detect that certain services were interconnected in ways we hadn’t anticipated. This insight was particularly exciting for me to work on, as it involved a double-pass method—first analyzing current patterns, then querying historical patterns and aggregating the data. However, despite the fascinating nature of these findings, we didn’t include this insight in the final product. The primary reason was the difficulty in making actionable suggestions based on this information. At Komodor, we strive to not only identify problems but also provide clear recommendations for mitigating them. This insight, while cool, didn’t quite meet that standard.

Moving on to how we developed these insights from concept to product, we started with experimental setups. We had data sources, Python scripts implementing algorithms, and CLI outputs. The key here was the ability to quickly apply and validate ideas without being limited by constraints. This freedom to experiment was crucial in the innovation process.

In the second phase, we moved into an internal review. We had to complicate the setup a bit—introducing schedulers to run insights on a daily basis, storing results in a PostgreSQL database, and presenting these through a low-code AppSmith UI. This phase was particularly enjoyable because it involved optimizing SQL queries and adding database indexes, which drastically improved performance. We also began dogfooding these insights in our own production clusters, asking engineers to find relevant insights and assess their value.

Finally, when we transitioned to the production environment, we developed a multi-threaded running system for these insights. Despite considering third-party solutions like Argo Workflows or Apache Airflow, we opted for a custom solution to meet our specific needs. This solution was simple yet functional, allowing for retries and ad-hoc analysis, which was essential for our operations.

To summarize, developing these reliability insights for Kubernetes was resource-intensive but incredibly rewarding. The basic Kubernetes data streams hide advanced analytics potential, and with the right tools and approaches, you can unlock valuable insights that significantly improve reliability. In our own production environment at Komodor, these insights have already helped clean up numerous issues, leading to more reliable clusters overall.

Now, let’s take a quick look at the Komodor platform. Here, we divide Kubernetes operations into three main areas: reliability, cost, and policies. Today, we’re focusing on reliability. For example, you can see a cluster’s reliability score, which is calculated using a specific formula. The platform also alerts you to end-of-life clusters, deprecated APIs, and noisy neighbors—services that disrupt others and need stricter limits.

In addition to reliability, the Komodor platform offers comprehensive cost optimization features, including right-sizing recommendations. These help you optimize resource usage by analyzing metrics over time and suggesting adjustments to CPU and memory allocations.

Finally, I’d like to touch on the future trends in Kubernetes. As organizations scale, we’re seeing a shift towards fleet management, where managing hundreds or thousands of clusters will become the norm. This is why Komodor is inherently multi-cluster, designed to address these emerging needs. Some of our customers are already planning for thousands of clusters, and we’re developing solutions to meet those demands.

With that, I’ll open the floor to any questions you may have. Feel free to ask anything related to Kubernetes, reliability insights, or Komodor’s platform.

Nikki: Thanks, Andrei. That was incredibly insightful. We’ve got a few questions lined up already. The first one is, what are the best practices for leveraging the vast amounts of data generated by Kubernetes clusters to gain actionable insights?

Andrei: Great question. The key is to have the right tools that can process and analyze the data effectively. Kubernetes itself is quite basic; it primarily speaks in YAML. To derive actionable insights, you need tools that can interpret these low-level details into something more meaningful and actionable. At Komodor, we’ve built sophisticated algorithms and analysis techniques to do just that, transforming raw data into valuable insights.

Nikki: Another question we have is about the dashboard you showed. Is it open source, and can we try it out ourselves?

Andrei: The Helm Dashboard by Komodor is open source, and you can find it easily by searching for it online. However, the specific reliability insights I’ve shown today are part of the main Komodor platform, which is not open source. You can try it out by signing up for a trial of the Komodor platform, which should give you access to these features.

Nikki: And what about future trends? How should organizations prepare for advancements in Kubernetes cluster management and reliability?

Andrei: The trend is definitely moving towards AI and automation, but the biggest shift will be in scale. Fleet management—managing thousands of clusters—will become a standard practice. Organizations should prepare by adopting multi-cluster management tools and strategies now, as this will become increasingly important in the near future.

Nikki: Another interesting question is about container resource tuning. Does the Komodor platform provide recommendations for complex cases, like when there’s a big difference in resource requests and limits?

Andrei: Yes, we have specific insights for under-provisioned workloads and other scenarios where resource allocations might be risky. Komodor provides detailed recommendations for right-sizing your containers based on historical data. For example, if a service uses a lot of memory but minimal CPU, we’ll suggest adjustments to optimize that balance. We also offer a right-sizing advisor that helps you manage costs by ensuring that your services are not over or under-provisioned.

Nikki: And finally, someone asked about AI-driven insights for SRE activities. Is Komodor working on something like that?

Andrei: Yes, we are! While I can’t share too many details just yet, I can say that we’re developing AI-driven solutions to help democratize Kubernetes and make it more accessible to everyone, regardless of their expertise level. We’re applying language models to semi-deterministic areas like reliability, and the results are promising. Stay tuned for an announcement soon.

Nikki: That sounds exciting! Thank you, Andrei, for all this valuable information. And thank you to everyone who joined us today. You’ll be receiving a copy of this webinar and Andrei’s slide deck. We’ll be taking a break next month but will return in September with more exciting content. Have a great night, and we’ll see you soon!