Home
Blog
Crossing K8s Monitoring and Observability Gaps With Change Intelligence

Crossing K8s Monitoring and Observability Gaps With Change Intelligence

Itiel Shwartz, CTO & co-founder

7 min read May 27th, 2024

Recently we had the privilege of being named a Gartner Cool Vendor in the Monitoring and Observability category. The funny thing is, while this is definitely the closest Gartner category for our solution, we aren’t really used to thinking about Komodor as a monitoring and observability tool. If anything, Komodor was born out of the realization that monitoring & observability solutions have an increasingly critical gap, which is the lack of correlation between changes in the system, and the ability to understand how these impact your systems.

Focusing on the changes makes the difference between seeing the current situation and understanding the chain of events that preceded it – shifting the center of attention from the outcome to the cause.

We like to think of this as ‘Change Intelligence’ and, when compared to existing monitoring & observability solutions, change intelligence is all about providing a broader context about the existing telemetry and copious amounts of information systems provide today.

Now that I’ve provided a bit of context regarding what change intelligence is, let’s dive into the challenges I believe it will solve across these three domains:

Change Tracking
Situational Awareness
Alerting

This is part of a series of articles about Kubernetes Troubleshooting.

1: Change Tracking

The Gap: Collecting & Aggregating Relevant Changes

Currently, no monitoring or observability tool provides this type of system-wide, historical (and present) changes in a simple view. It would require a lot of manual work and extrapolation to extract specific data points about your overall system and its components. Without understanding the change, you’d most likely need to hunt down this data through your CI/CD tools, and aggregate it manually into spreadsheets to track this information. It goes without saying that this becomes Sysyphean over time, and provides little correlation between the information being collected and its historical impact.

When talking to early adopters of Komodor, one of the first tangible benefits was the simplicity of aggregating all of the historical system information and data on all of the different services and components into a single, unified place. This enables our users to immediately understand what has been changed, and when the change occurred across the board, until today.

The Gap: Correlating the Data to Overcome the Knowledge Gap

Even if you manage to track down your system info from your existing monitoring and CI/CD tools into an offline format, the data would still be static and eventually grow stale if it is not continuously maintained. What’s more, there certainly is no correlation or analysis on the relevance of the specific components and how they impact one another – this would need to be manually deduced as well.

With Komodor, on the other hand, the system data is collected and aggregated into a single place, and is then simply parsed and analyzed – even among disparate tools like Kubernetes and Github, so you can understand how a certain change impacted the rest of the system as a whole.

These two together come to alleviate an acute pain when it comes to operations engineering, which is a lack of knowledge and transparency to what happened in the system and is continuing to happen in the system as a result. Even if there is a single power engineer on the team that has a magical script that enables the aggregation of this information, it’s still (for the most part) siloed and not transparent or scalable, for the entire engineering organization.

2: Situational Awareness

The Gap: Macro-Level System Information

Macro-level system visibility is hard to attain for multiple reasons. First, it often requires multiple steps to gain the cluster-level information you are looking for. At other times, you may not even know what you are looking for to begin with, and this becomes increasingly difficult when attempting to track this down in real time.

There is currently no tool purpose-built with this in mind. So your options would be:

KubeCTL/K9s: You could run KubeCTL get or follow commands or possibly use a tool like K9s. The downsides? You’ll only get a deeper view into a specific cluster’s health, and only those you have authorization and access to. You would not be able to get a high-level view of all of your clusters and their health as a whole.
Monitoring Solutions: The alternative option would be to gain this kind of insight via your monitoring tool. But let’s be frank here – the process of filtering the relevant data becomes oftentimes very complicated in a complex operation with many clusters and namespaces, especially with tools that are not designed with this use case in mind.

Either way, both solutions are just not an ideal option for rapid, real-time remediation purposes.

The Gap: Correlating Change to Cluster Health

Today’s monitoring and observability tools are mostly focused on the application level. Infrastructure level information, on the other hand, is extremely hard to extract, be it cluster-wide or system-wide info. What you’re most likely to come away with is the current health of a specific cluster, node or pod, and this makes correlating information across clusters nearly impossible.

However, with change intelligence solutions like Komodor, you can gain an immediate high-level understanding of your services across multiple clusters. So, if you come across an unhealthy service, the next logical step would be for you to understand (i) what caused the previously healthy service or cluster to become unhealthy, as well as (ii) how it is impacting your systems right now. This includes knowing what other alerts are currently being triggered, what new services were deployed, and receiving situational awareness of what is happening in your specific cluster.

This real time information is provided in Komodor through its timeline view. By viewing actions over time, you are able to have historical through real-time cluster-wide information concentrated into a single view. This makes it possible to draw conclusions much more rapidly regarding what has impacted the health of your services, and why it occurred.

3: Alerting

The Gap: The Alerting Learning Curve

The alerting complexity starts with the most basic of questions: What do I even need to alert upon? What are the interesting things to keep track of? What are the relevant thresholds or best practices for doing so, without having to reinvent the wheel? These are questions and best practices that you have to learn and research on your own when it comes to configuring your monitoring and observability tools.

This difficulty is compounded with the dependency on third-party tools and services, such as Helm charts and databases, because you’re already in a place where you’re not quite certain how to monitor your own system, let alone an external system.

But let’s say you’ve managed to identify what you need to monitor and alert upon – setting up the configuration in your monitoring and observability systems is HARD. Even when you know what you want to monitor, it’s often difficult to actually achieve the results you’re looking for. You need to learn a lot about a specific monitoring dialect, which is trickier than it may seem. Even at Komodor, 95% of the alerts in our systems are configured by one expert on the team.

Given the fact that learning the correct syntax is often quite difficult to do on your own, you may just resort to copying an alert from another service or even stack overflow. As an example, I remember trying to edit an alert from max to min in Prometheus, and the whole system just crashed. And it didn’t end there – not understanding the dialect also made it exceedingly more difficult to troubleshoot.

On the other hand, it’s worthwhile to note that APM solutions, and tools like Jaeger come to simplify this, as this is their specific scope, and Komodor learned from this and embedded similar simplicity into its platform. However, as mentioned, regular monitoring tools are built for power users, and a regular user will oftentimes find themselves lost in this domain. This knowledge of syntax and dialect is not unified across platforms eithers, so even if you learn one system, it will not carry over to another monitoring system.

This is why Komodor wanted this to be as simple as clicking a button to configure alerts, and not require low-level system familiarity of all the moving parts and pieces to gain the benefits and critical needs of basic alerts.

The Gap: Kubernetes-Specific Design

Last (but definitely not least) is the most significant challenge that is the foundation of all the other issues with alerting in today’s tools: these solutions were simply not designed for Kubernetes. They are general purpose tools that ALSO work with K8s, but what you’ll get are basic, generic building blocks that are not customized for K8s and are based on generic metric databases that are not aligned with Kubernetes.

These generic metrics and building blocks also contain Kubernetes-specific blind spots, and will fail to alert you in certain scenarios or issues, because they don’t have the K8s-relevant building blocks. Case in point: most monitoring tools are focused on app-centric alerts – in Kubernetes these would be pods, daemonsets and such. Non-app-centric elements, such as PVCs, load balancers, ingresses, among others, are generally missed.

Therefore, configuring an alert in a typical monitoring tool that is not app-centric but rather infrastructure-centric, is extremely difficult, and I would venture – nearly impossible. Gaining visibility into your Kubernetes cluster communicating with your database in AWS and receiving alerts about it, just isn’t possible with this nature of generic building blocks.

What will more likely occur, is that you will begin to receive alerts on the application layer that something is wrong, but you don’t know the underlying issue, because you don’t have any kind of alert mechanism for these components.

By creating simple, human configurable, Kubernetes-optimized alerting out of the box, Komodor is looking to prevent single points of failure and minimize the knowledge gap, reduce friction and complexity, as well as provide holistic alerting from the infrastructure through the application layer, particularly for Kubernetes systems.

Change Intelligence for Rapid Detection, Investigation and Resolution

Change intelligence isn’t Kubernetes-specific by any means, however K8s made changes and deployments to production environments so much easier, that these challenges became acute in K8s environments.

It’s no secret that with the power of K8s, came a lot of operational challenges, many of which are knowledge gaps that are impacting all levels of operations, from detection to investigation.

These gaps are partly in knowledge and partly technical gaps in existing tools that haven’t completely risen to the Kubernetes challenge yet. While more knowledge and best practices are being made available all the time for monitoring K8s, these gaps and missing alerts are still acutely felt by Kubernetes users today. Komodor chose this domain to excel and focus on, as Kubernetes gains momentum, and is well-positioned to become the de facto operating system for modern microservices operations.

Latest Blogs

Embracing Open Source in the Enterprise: Strategies & Best Practices

Crossing K8s Monitoring and Observability Gaps With Change Intelligence

1: Change Tracking

The Gap: Collecting & Aggregating Relevant Changes

The Gap: Correlating the Data to Overcome the Knowledge Gap

2: Situational Awareness

The Gap: Macro-Level System Information

The Gap: Correlating Change to Cluster Health

3: Alerting

The Gap: The Alerting Learning Curve

The Gap: Kubernetes-Specific Design

Change Intelligence for Rapid Detection, Investigation and Resolution

Latest Blogs

Embracing Open Source in the Enterprise: Strategies & Best Practices

Avoiding Hidden AWS EKS Costs: A Guide to Cluster Management

Kubernetes Migration – Moving on to Day 0

Crossing K8s Monitoring and Observability Gaps With Change Intelligence

1: Change Tracking

The Gap: Collecting & Aggregating Relevant Changes

The Gap: Correlating the Data to Overcome the Knowledge Gap

2: Situational Awareness

The Gap: Macro-Level System Information

The Gap: Correlating Change to Cluster Health

3: Alerting

The Gap: The Alerting Learning Curve

The Gap: Kubernetes-Specific Design

Change Intelligence for Rapid Detection, Investigation and Resolution

Latest Blogs

Embracing Open Source in the Enterprise: Strategies & Best Practices

Avoiding Hidden AWS EKS Costs: A Guide to Cluster Management

Kubernetes Migration – Moving on to Day 0

Sign up for FREE