Home
Komodor Blog
Do We Still Need to “Observe”? The Future of AI & O11y

Do We Still Need to “Observe”? The Future of AI & O11y

Guy Menachem

9 min read April 1st, 2024

AI has had a massive impact on every part of our lives, but mainly on how we consume large data sets easily. The observability world is based on collecting enormous amounts of data and consuming it by observing dashboards built on monitoring tools. Most of the o11y tasks like writing complex queries, creating dashboards & defining alerts, have been done much in the same way for the last decade & AI models are well-positioned to disrupt this modus operandi.

In this talk, we’ll discuss how the latest & greatest AI capabilities are going to change the observability world as we know it. We’ll dive into what not writing PromQL anymore looks like (woot!), how automated correlation and finding of causalities will lead to better and faster root cause analysis, the joys of not creating dashboards, not defining alerts, all with auto-integration of collections, and world-class queries in seconds. Sound too good to be true?! Well, the future is here, and it’s observably glorious.

The following is an AI-generated transcript of the speaking session:

Hey everyone, how are you today? Okay, so we’re here to talk about something new. I don’t know if you’ve ever heard about it, but we’re going to discuss AI. To make this more engaging, we’ve invited our friends, Floppy and friends, to join us on this adventure. So, let’s get started. Nice to meet you all, I’m Niv Yungelson. I’m an AWS Community Hero and AWS User Group Leader. Up until three and a half months ago, I was the Head of DevOps at Melio, and nowadays, I’m a Cloud Consultant. I have the nerdiest tattoo – a sudo tattoo on my finger. Hello everyone, yeah, clap your hands, that’s perfect.

Hello, everyone. I am Guy, a Solution Architect at Komodor. I’m leading the Platformers Community for platform engineering and also a CNCF Ambassador. I’m really excited to be here with you and share our thoughts about AI and observability. So, what are we going to do today? We are going to do a quick intro about AI, making sure that everyone in the room knows what AI is and how we can use it. I guess that you have some experience with AI up to today. We’re going to discuss how it matches with observability and how AI is actually going to change the way we observe and run observability, with real-life examples, by the way.

Cool, so let’s start with a quick introduction about AI. I guess that some of you, or maybe all of you, have tried ChatGPT or one of the other chatbots. Essentially, what we are using are large language models, or LLMs. They are models of programs that receive text input. You prompt them to do anything, hopefully, they can do that, and based on the data they were pre-trained on, they provide you with a text output. But they do some complex things in between. They try to predict the best result for you. If you ask something, they evaluate your question based on the data they have been pre-trained on and then give you what should be a good result.

The main use cases for this are usually summarization. If you search for something, you get a summarization from any source on the internet, or you try to predict something or generate content. You ask them to write a poem or even a book. These large language models can do that, and they’re still available. OpenAI’s model is the most common one today, but there are many available. Mistral is one of the most common ones for on-prem environments. You can use whatever you want, but there is one single problem with them: they are too general for observability. They are trained not for observability but for simple tasks like summarization and generalization.

So, what do you think we can do to solve that? Yes, so basically, we have a possible solution: we can do fine-tuning. What we are going to do is take a general model that can solve a problem, for example, an LLM model, and then we take a subset of this model and train it with our data, with our organization’s data. Then, the answers that we’re going to receive will be more custom-made for our needs. However, we still face a few problems. The main issue is that LLMs cannot query live data. They can’t create a PromQL for it and give you the results back. This is super interesting because it’s used a lot, especially with new applications using AI. This way, each company can define its own use case and make it very general, and connect the LLM to their own database.

So, how do you think the AI world will look if we add a bit of AI to it? Oh, wow. Which part of the observability world? Because there are so many steps. Now that we’ve gone through all the basics and everyone is on the same page, knowing what those buzzwords are, we’re going to take you through each step of observability. Together, we will try to imagine how this world will look like when we add the magic of AI to it. Maybe it will be a year from today, in five years, or a decade. My guess is that it’s going to happen sooner rather than later because all of us here are builders, and the timeline is really

up to us.

So, the first step is continuous preparation. I bet all of you have done it—deciding which logs to collect from which server, which metrics are needed, and what the retention period for everything is. On top of that, we need to build dashboards and set up alarms. It takes a lot of time. Every time we add a new service to production, we need to go back and decide all these things all over again for every service. But, maybe when we add the magic of AI to the continuous preparation step, the model will create the dashboards for us. Actually, it’s not something so futuristic because even now, we can create YAML using AI, and it makes things so much easier. It can create complex queries for us. We don’t need to have so much knowledge or experience to create a good dashboard or to prepare good observability. And the fun part? You know how the DevOps or the operation person is always a bottleneck? Well, no more, because with AI, even a junior developer or a product manager, anyone can create queries and dashboards by themselves. That’s amazing.

Like, how many of you in the last week went to one of the dashboards and did your weekly, daily, or monthly review? I see a few hands. Once in a while, we go to the dashboard and try to assess if all the data we are looking at really looks good, if the data we are reviewing on the dashboard is actually relevant, and if we have any concerns that we need to solve. All of these questions are something we ask ourselves once in a while, and it’s nice but it takes a lot of resources. Imagine how many dashboards there are and walking through them can be really painful. What we can do is have an AI assistant that will go through our dashboards and refine them for us. They will check the results, the data, gather insights for you, so you don’t have to do it on your own. For example, if something is going to explode, you would get it flagged, and that’s amazing.

Moreover, we do basic calculations on our own but AI tools are really packed with advanced algorithms, giving them much more advanced functionalities than what we have today. There’s also a little thing about ownership. We know that we still need to be the owner of our dashboards and review them once in a while, but we can significantly reduce how frequently we do that, making AI our own observability assistant.

What do you think about alerting? Nowadays, with alerting, we face many challenges. One of the biggest ones for me is the fact that we need to configure each alert separately, even if we do it by code. We still need to define what the threshold for each service is, what is critical for production, and what is not. Another big challenge is that we probably all have this SLA channel with critical production alerts that alert you at night, but probably at some point, you had some false positives. They weren’t really false positives; they were positives, but as the system changes and scales, those thresholds stay the same while the criticality changes. Then, you just get alert fatigue because you have so many alerts in this channel that used to be just for the most critical ones. It’s like the boy who cried wolf; you don’t know if you really need to wake up in the middle of the night to take care of it or not. So, it does create a big challenge.

With the magic of AI, we can create alerts using LLMs for a start and get smarter thresholds. Nowadays, the thresholds are static, but with AI, it can look across the entire system, make correlations that for us as human beings are harder to make, and create dynamic thresholds that change by the state of the entire system or by the day, which is really cool. We can also get complexities and dependencies out of the box. We don’t need to do it by ourselves, saving time by configuring one by one.

And how about investigation? Do you like investigating downtimes? I don’t like investigating at all, and I don’t think anyone does. You get an alert in the middle of the night, and then you need to start investigating. You need to find the right dashboard and figure out the right metrics, maybe do some correlations on your own. That’s bad. Sometimes, you don’t need to investigate an incident; sometimes, you just have a query about your infrastructure or software. That’s great because we can leverage LLMs and use them to investigate. We have connected them to our database, and now they can create the dashboard for us. When we have an incident, you don’t need to define it in advance. You can say, “This is my incident; these are the relevant components. Please bring me the most relevant dashboards.” As it runs like PromQL queries, you get a query today, you would get the full dashboard just for that, and that can be a game

-changer when we go into investigation.

Something we see a lot is the way we extract logs. Sending the logs to our system is the easy part, but filtering them can be really hard, especially for people who are not the experts in the team in terms of observability. They don’t really know how to filter events and get the essence out of the logs. We can use AI to delve into the logs, find all the relevant data, and summarize the logs for us. This can be a real game-changer. One of the main things we talked about is that there’s the current state of the environment, and we want to keep that in mind when we investigate. One question we ask ourselves is whether we should solve the problem immediately or keep investigating to find the root cause on a deeper level. Knowing that the first action we take is going to fix it, these questions come up in any part of our incident management process.

What LLMs allow us to do is essentially take a snapshot of the entire environment and gather all of the information, especially if we’re running or if our servers are recycling, to give us information. That can be truly a game-changer in how we query what we want, whenever we want, and in a simple way we want. The whole investigation world is going to turn upside down with LLMs.

What do you think about predictions? That’s my favorite part. Who here does on-call? Yes, thank you. But most of you, if not, then maybe you are a CTO or something, because everybody I know does on-call duty. It means that we’re not so good at doing predictions by ourselves, so there’s a lot of room for improvement. With AI, we can basically make ourselves sort of a Megazord combination with a data engineer. If you are data engineers, then you can make yourself a Megazord with an operations engineer. Then, you have all the skills needed to make correlations across the entire system and make predictions more precise and better, and, of course, sleep better at night.

And after the downtime, as if it’s not bad enough, you need to write the postmortem. Exactly, no one likes to write postmortems. You need to gather the information, go through the metrics, the observability tools, the incident, maybe Slack or the chat you’re using, gather all these pieces of information. It takes a very long time. After we gather all of that, we need to analyze. We need to think about which decision we made one year ago had the impact we see today and what we need to do in the future to improve. So we need to analyze and be the big thinkers of what a good postmortem should be.

That’s interesting because there are so many postmortems written all over the world, based on the same data, and for every different company and team, we will get completely different postmortems. What we will be able to do first of all is not to gather the information ourselves; something else will gather the information for us, which will be very impactful for data gathering. You will have everything in a single place, and from that point, you would get automatic analysis. Something will take the most intelligent brains in the world, bring it together to one place, which is your data, and write the best postmortem that ever written for the same case. Maybe it’s already pre-trained on similar postmortems and how they solved them, and that will be very impactful on how you are going to benefit from postmortems.

And the last thing is, when we have a postmortem, it doesn’t end with writing the postmortem, right? It’s just simple words on a text document. We need to follow up on the action items and deliver them. Then, we actually go back to the start when we have action items; we go back to continuous preparation, and so on and so on. Exactly, we want, first of all, to implement the action items, which is really hard, and second of all, we need to follow up on them. I may guess that all of you, and you are free to raise your hand if you are not in this case, have implemented all of your action items from postmortems. Some people are laughing; I can see you understand what I mean. It’s really hard to follow up and make sure that we close all the action items.

So, will AI replace us? Well, that’s a tough question because, yeah, maybe we’ll be out of work soon. No, but seriously, something that we do have and AI doesn’t is that we have ownership. We need to take ownership and be responsible for everything we make. So even when all of us add all these things that we theoretically suggested to every step to make our life easier, AI cannot take accountability. As long as you are the owner and you take accountability for everything that happens in each step, wherever you work, I think we’re still needed. I think we’ll still have a role, without deploying infrastructure as code becoming obsolete.

And that’s it for today. Thank you everyone for being with us today. If you have any questions, feel free to jump in and ask us here. Thank you.

observability-ai-kubecon-komodor-melio-guy-menahem

Latest Blogs

Kubernetes Is Powerful—But It’s Slowing You Down. Here’s How to Fix It.

Ask any SRE what slows them down in a Kubernetes incident, and the answer is usually too much information in too many different places.

Komodor Scales U.S. Operations with Key Sales and Engineering Leadership Appointments

Kubernetes Cost Optimization Done Right

At Komodor, we’ve always believed Kubernetes management has to be holistic. That’s why our platform is built on four pillars: health and reliability, access governance and user management, drift detection, and cost optimization. And now, we’re taking that last pillar to a new level. Beyond showing you where your money is going, we’re helping you improve efficiency safely, automatically, and intelligently.