Episode #49 25:09 2025-09-24

#049 – The AI Translator: Using LLMs & MCP for K8s Operations & Self-Healing Infra with Alexei Ledenev (doit)

Alexei Ledenev
Platform Team, DoiT

Listen to the Podcast

Episode Overview

In this episode of Kubernetes for Humans, host Itiel Shwartz kicks off a mini-series on MLOps, LLMs and GenAI on Kubernetes with Alexei Ledenev, a platform engineer at DoiT and the author of open source tools like Pumba, k8s-mcp-server and aws-mcp-server. Alexei traces his container journey from CoreOS Fleet through Docker Swarm to Kubernetes, then digs into why he stopped chatting with AI assistants and started giving them real tools instead. The conversation covers how the Model Context Protocol turns kubectl into something an LLM can actually drive, why hallucinations drop sharply once the model can call tools and read errors, and where this is all heading: platforms that are self-aware enough to detect, predict and heal their own failures while operators set guardrails instead of running commands.

In this episode we discuss:

  • From CoreOS Fleet and Docker Swarm to Kubernetes — what early orchestration looked like and why Kubernetes won
  • Why even experienced DevOps engineers find Kubernetes overwhelming, and where AI assistants actually help
  • How the Model Context Protocol (MCP) turns kubectl and cloud CLIs into tools an LLM can drive directly
  • Real-world MCP use cases: troubleshooting a misconfigured ClickHouse operator and spinning up ad hoc test environments from a prompt
  • Reducing hallucinations with the right tools and context, and where self-healing, AI-aware Kubernetes platforms are headed

Key Takeaways

1
Kubernetes' real complexity is that infrastructure-level problems — networking, security, configuration — all reappear at the orchestration layer, which is exactly where LLMs can offload cognitive load.
2
MCP is the missing piece that lets AI assistants stop giving generic advice and start executing kubectl, reading logs, and exploring a live cluster like a senior engineer would.
3
Hallucinations drop dramatically when the model has tools: it tries a command, sees the error, asks the MCP server for help, and corrects itself on the next call.
4
Pairing a Kubernetes MCP server with a documentation MCP like Context7, plus a tool-capable model like Claude, is the practical recipe for reliable AI-driven troubleshooting today.
5
The next generation of Kubernetes-like platforms will blur the line between application and operations: the platform self-heals and self-scales while engineers provide guardrails, not runbooks.

Itiel Shwartz: Hello everyone and welcome to another episodes of the Kubernetes for Humans podcast. My name is Itiel Shwartz and today we have in the show Alexei. Alexei, can you please introduce yourself?

Alexei Ledenev: absolutely and thanks for having me. uh today it’s a great honor to be on this podcast.

Itiel Shwartz: So so tell us a bit about yourself. What do you do? Where do you work? When did you start working with Kubernetes? And I’ll say that this episode is going to be one of of couple that we’re going to do now around MLOps LLLM and like geni in Kubernetes together. So you are the first guest that I’m interviewing on that topic but if you can yeah introduce yourself a bit more.

Itiel Shwartz: Okay. Sure. So I’ve I’m in software development for over uh two decades now and uh diving into areas like uh software backend development, cloud architecture, DevOps and lately even some AI. It was a pretty exciting ride and my journey really started with you know passion to problem solving and building things that’s what I’m doing always and uh over years I always try to find or gravitated towards problem that are a bit messy unsolved especially in some complex distributed systems. As for the Kubernetes actually the journey started from Docker container. Initially I was kind of fascinated by concept of containers and I start to develop containers start to learn how things work from under the hood. So that’s why I built for example my first open-source project called Kube-Monkey. It’s a chaos testing for containers. uh quite a popular project and uh when we start to use the containers at scale initially I worked with Fleet if you remember Fleet from cor

Itiel Shwartz: then moved

Itiel Shwartz: what was like you know share a bit more and also if you can share a bit like where are you currently working where did you work like give a give a bit more background like

Alexei Ledenev: okay okay I’ll give it

Alexei Ledenev: so currently I’m working in DoiT I’m working on platform team and before journey I also let DoiT team of senior cloud architect consulting helping DoiT customers to work with cloud services and Kubernetes across multiple cloud providers like AWS, Google, Azure

Itiel Shwartz: and before that I was a solution architect at AWS

Alexei Ledenev: and before AWS I worked on slow Kubernetes startup it’s got fresh we were building the CI/CD platform for Kubernetes

Itiel Shwartz: reckon

Alexei Ledenev: and the company still exist but it was acquired by I believe Octopus Deploy

Itiel Shwartz: well deploy like now the main product of Codefresh is more like managed Argo right

Alexei Ledenev: yes exactly and they are core maintainers of Argo

Itiel Shwartz: mhm

Alexei Ledenev: and uh also work many years in enterprise companies like um HP Enterprise and Mercury Interactive and some others. So it was a long journey as I said.

Itiel Shwartz: Yeah. Go.

Itiel Shwartz: No, no, no. Now, now you talked about like you trying Fleet like we talked about Kubernetes and you mentioned like Fleet. So yeah, now when you try that like where you working DoiT is like one of the like biggest in Israel in general like uh service provider right around like everything related. So yeah, now we’re going to share a bit more about like the Kubernetes journey. I don’t remember Fleet I’ll be honest here. So if you can like refresh everyone memory on that and yeah

Alexei Ledenev: so initially it was kind kind of core itself. So I like the concept of CoreOS it’s tiny operating system with interesting concept of upgrading like and then they released something called Fleet for continuous orchestration. that was based on systemd and some demon that help you to orchestrate continuous running through the systemd it’s was very low level at the end they kind of uh stopped this project because they switched to Docker swarm what another uh orchestration platform

Itiel Shwartz: yeah so what you were saying like Fleet was before like Kubernetes was that popular back then the main popular were like Docker swarm right Mesosphere and I don’t remember like Fleet I’ll be honest so it’s even prior to that and then you moved to another like container like orchestration that also didn’t end up that successful right

Alexei Ledenev: yes and at the end actually I get to the Kubernetes and I like it a lot it was kind of amazing mhm

Alexei Ledenev: and but again once you go deeper you understand it’s also a bit complicated It’s not that simple. It’s kind of

Itiel Shwartz: another layer of architecture on top of your existing infrastructure architecture

Alexei Ledenev: and the problems are solved at infrastructure level like virtualization and compute resources and networking and security you need to solve them again at now at container orchestration level. Okay.

Alexei Ledenev: uh and uh suddenly you need to handle not only the way you are running your code but also how you orchestrate your Fleet of microservices. you need to deal with kind of networking security.

Itiel Shwartz:

Alexei Ledenev: Deployment, configuration, updates, a lot of uh headache. but again what actually I like about Kubernetes is it was the community.

Itiel Shwartz: I agree.

Alexei Ledenev: Community is very good and very wide and open source spirit actually. it it really helped to get deep into Kubernetes having this community around. you see a lot of solution being developed around Kubernetes like companies your company and many other companies develop services solution as the open source as commercial software you can always find a few solutions solving trying to attack the same problem in different way. So it’s a very rich ecosystem and a lot of people who are very familiar with Kubernetes around.

Itiel Shwartz: So you started working with Kubernetes. You are seeing a lot of companies. I know that you have a couple of like super or like very like popular open source project and also quite a lot in the like MCP and like LLLM in general. So like why LLLM? Like you know you come from a very like deep infrastructure background, right? So maybe share a bit about your journey towards like learning LLLM to begin with like why and also if you can share a bit about like LLM and Kubernetes and the good the bad and like your experience with that.

Alexei Ledenev: sure sure. So um uh actually in in DoiT I spent a lot of time helping teams to adopt Kubernetes and one pattern I kept seeing is that even experienced NGINXer and DevOps NGINXers and developers found a Kubernetes a bit overwhelming.

Alexei Ledenev: you know it’s very rich in terms of number of object relations CRDs configurations tools kind of it’s hard to know everything

Itiel Shwartz: h it’s there is so much to remember for kubectl comms and yl syntax and how you troubleshooting failed configuration and service like you need to be a real guru here

Alexei Ledenev: that’s true

Alexei Ledenev: and when something goes wrong, it’s easy to get lost. Okay, so it’s easy. At the same time, you know, lately like last year, bit more like two years, there is a rise of AI assistance that started like basic fun chat with AI like chat GPT. But uh of course uh for last year that these are really they help you to to write a better code. They help can help you with troubleshooting and uh it’s easier to you know translate natural language to concept like scripts and codes and commands you need to run. But I started to wonder why actually I need to chat with my assistant why it cannot actually execute the commands actually explore my environment and not give me generic advice okay if something gets wrong this is generic way to find out this is number of commands you can run and give me the output I will say you

Itiel Shwartz: so it’s kind of pingpong game with AI chart and uh and initially I started uh working on dynamic code generation and execution not for Kubernetes for AWS.

Alexei Ledenev: And uh when Anthropic just released their Model Context Protocol MCP uh which is kind of um for people who are less familiar with MCP I believe it’s hard to be nonfamiliar with MCP right now but again it’s everywhere but people sometimes uh define it as a USB hub for AI integration. So it’s easy to plug in different tools uh that follow the same standard for communication with the tools and suddenly you allow your AI powered assistant chat or whatever application to you provide access to these tools and this the AI assistant can use these tools to get in more information to perform some command selection instead of you and uh then I realized it was kind of missing piece I don’t need to develop dynamic code I can use tools that were initially developed for human like for DevOps NGINXers for people to use this tools so say okay let’s try to give these tools to the AI assistant and in case it’s it is not aware about exact syntax of the tool we And uh I can also provide the detailed uh list that parameters and give some example how to use the tool so it can learn while it’s running. So that’s how the actually idea to take ubectl and some other helpful tools and allow AI system to access these tools and execute them was born initially.

Itiel Shwartz: So, so maybe like take let’s take like one one step back. You’re saying you wanted to have your own like chat, right? that will know your environment unlike asking in Google or in like ChatGPT like how can I solve this I don’t know like pod problem you want someone that knows your system that know your clusters that knows everything and have much more context around that and like with the new with like the new I would say like anthropic release of MCP and like it popularity you wanted to do something similar to kubectl right so the goal is how can I enrich my LLLM to be more I want to say like smarter right but more contacts aware in the day is that is that correct

Alexei Ledenev: exactly to to actually if it doesn’t access the tool the model remains kind of static it knows only things that you know for the time it was trained and then it needs additional tools like a web search. If you would like to get some more actual information or you need to context is the king in in LLM. If you would like to get a better answer, you need to provide more context to the model. Of course, you can go and extract the data by yourself. But again, you can give model access to your environment, to your information, to your documentation. And with MCP it even took commands to execute some commands and figure out how to run that. Actually, I can imagine u it like you have a really smart friend who knows Kubernetes inside out and you can just talk in your in human language and ask uh their friend to help to troubleshoot things, to deploy something, to configure something and it will tools like Kube-MCP will DoiT for you. uh instead of you doing it.

Itiel Shwartz: Okay. So, and it’s kind of translator between AI and the Kubernetes environment. You’re working this and and what what’s the feedback in the street? Like I know it was released as an open source project and you also have one around AWS, right? Like I just I just h so like how big is this ecosystem? Like how many people really care about it? Is it only a buzz word? like share maybe share a bit about like the actual usage of those kind of LLLLMs in production or in like real life scenarios.

Alexei Ledenev: So I personally use it and I use it lately to troubleshoot ClickHouse installation on Kubernetes cluster.

Itiel Shwartz: Oh cool.

Alexei Ledenev: It was a kind of misconfiguration because you know you install a ClickHouse with some uh CR with ClickHouse Operator and actually it has its own syntax to describeing configuration that you need to learn how to describe it properly. At the end it was kind of small misconfiguration that took time for me it would take time to figure out what was wrong. It was a spell check mistake in in namespace where ClickHouse should install itself but actually for some reason it didn’t record it as an error.

Alexei Ledenev: So it was not visible from you know expecting log and events everything was perfect but ClickHouse cluster was not running. So you need to go and actually to explore a lot to run multiple commands to explore configuration documentation also logs events everything till you permissions till you find the real problem. Of course you can DoiT but actually AI can DoiT for you and can DoiT much faster. I also experienced like troubleshooting fail containers where you can go and try to find why container is failing

Alexei Ledenev: like nothing special and uh people are are you using it again I’m not monitoring the usage but people asking me about question about you know for LinkedIn for GitHub contact me directly asking how how they can configure it how they can connect to the cluster. So mainly for troubleshooting scenarios I would say

Itiel Shwartz: inside there are a lot of commands you need to run to figure out what’s going wrong on there and some also for creating ad hoc environments I also used it for created ad hoc environments for example it’s easy you know it’s possible of course to create helm charts that uh you can describe I would like my WordPress and NGINX or whatever running is this namespace but you know if it’s testing environment and you don’t care you don’t need this count chart for long run you can say okay please create me and describe whatever you want you know you would like a autoscaling group of WordPress servers exposed through NGINX on load balancer and you will get it you will get it within few minutes

Itiel Shwartz: So, so like the you know as someone who also have we have our own like agent that helps you to troubleshoot things. I completely understand like the need how well does it work like like in like one of the things that were super hard for us that even after working on it quite a lot there were a lot of hallucinationsucinations like do you see it does it happen like what’s your take around like how can I take it to you know like to the next level in a way or like what’s the main like challucinationsenges on like implementing something like that or using it if me.

Alexei Ledenev: Of course, there are multiple hallucinations, of course, there are hallucinationsucinations, but again, uh providing the right context and the right tools to the AI and of course relying on models that are able

Itiel Shwartz: to understand instructions like models.

Alexei Ledenev: For example, I used a lot Claude on it, but also tried it with other models too.

Alexei Ledenev: Also local model. But for local model you need to take instruction-tuned model that are able to work with tools and once the model has a has tools and understands that it needs to call tool to execute action or to get more context about the problem it will hallucinationsucinate much less I also use uh Context7 I can recommend this MCP for up-to-date documentation about many tools and many open-source libraries can always ask your AI first to validate documentation using additional MCP and then this way you kind of reduce number of hallucinationsucination and again it’s AI it’s kind of try to run command sometimes you see it tries to run it uh this wrong parameters see the error mhm

Alexei Ledenev: then it ask Kube-MCP server please give me help how to run this comment the as a result it gets back the full list of parameters with explanation or some examples

Itiel Shwartz: and next time it run this comment in the proper way you know

Alexei Ledenev: and sometimes you don’t have permissions for example so there is another topic security but sometimes you don’t have permissions so it tries a different way to get the same information probably maybe it can get it through another kind of API mh

Alexei Ledenev: That’s this is the way but you allow it you rely on AI to fix itself you know and to use the tool to figure out how to call to get information how to execute the command.

Itiel Shwartz: Okay, maybe like as a last question I’ll ask you a bit about the future like where are we going as an industry like in in like a year two year three years time where is like this LLLM for troubleshooting going what are do you think like are going to be the next breakthroughs share a bit on how do you see the space

Alexei Ledenev: actually it’s a it’s a great time to be it’s kind of many people are scared from AI maybe we should

Itiel Shwartz: I don’t know like many maybe in a few years there no no place for software NGINXers but currently I believe it’s it’s more a very powerful tool that actually multiply your knowledge and skills and allows you to operate in a very efficient way.

Alexei Ledenev: I also see like uh combination of AI and infrastructure together

Itiel Shwartz: where probably next gen Kubernetes or Kubernetes like platform will be able you know we will not this have this kind of separation like we have today like you have application and you have operation. Yeah.

Alexei Ledenev: So you care about deploying your application and somebody else care about monitoring, loading, operating

Itiel Shwartz: and there is no action between these two.

Alexei Ledenev: Probably the system should be more self-aware and platform aware and should the platform should be able to selfheal to how to detect failures to to know how to uh to fix itself and how to fix the application and probably developer or DevOps NGINXer will provide guard rails or command direction. This is what you can do to in order to fix this is the direction but not to break the application to scale it properly not based on some CPU or some other metric but understand it what’s happens behind the scene and what should be changed right now if we need to add more resources if we need you know to save cost right now if we need to even predict failure based on some patterns that

Itiel Shwartz: I believe we are going into this direction more like getting more brains inside you know

Alexei Ledenev: very powerful API based infrastructure.

Itiel Shwartz: Yeah that that’s I’ll be honest like that’s what we also see and I and I hope it will happen as well. any last word Alexei before we finish?

Alexei Ledenev: probably I I need to think about it.

Itiel Shwartz: maybe some advice for guys who start with Kubernetes. probably I know the companies that with hype you know Kubernetes a very popular open source it’s everywhere but actually it’s not there are many teams that are just starting Kubernetes journey

Alexei Ledenev: and they find Kubernetes to be bit overwhelming to be a bit bit more complex

Itiel Shwartz: and my advice would be be patient and persistent you know Kubernetes may might have a steep learning curve but it’s very powerful platform

Alexei Ledenev: and once you adopted invest time in learning the platform knowing how it operates how it works you know even with AI tools don’t rely exclusively on the AI you should use AI as a helper tools that accelerate performance help you to troubleshoot but actually you need to understand how Kubernetes operates under the hood. You need to know this technology before you adopt it.

Itiel Shwartz: Okay, I agree. I couldn’t agree more, Alex. Thank you very much and pleasure having you.

Alexei Ledenev: Okay, it was pleasure. My pleasure.

[Music] Kubernetes for Humans.

This is an AI generated transcript of the conversation

About the Guest

Alexei Ledenev
Platform Team, DoiT
Alexei Ledenev is a software engineer on the Platform Team at DoiT with over two decades of experience across backend development, cloud architecture, DevOps, and lately AI. His containers journey began with Docker and CoreOS Fleet — one of the earliest container orchestrators, predating the rise of Kubernetes — before he moved through Kubernetes-native startups (Codefresh, since acquired by Octopus Deploy) and a stint as a Solutions Architect at AWS. He is the author of several well-known open source projects, including Pumba (chaos testing for Docker and Kubernetes) and the more recent k8s-mcp-server and aws-mcp-server, which let AI assistants drive kubectl and the AWS CLI through the Model Context Protocol.