Home
Learning Center
K8sGPT: Improving K8s Cluster Management with LLMs

K8sGPT: Improving K8s Cluster Management with LLMs

Itiel Shwartz, CTO & co-founder

5 min read April 23rd, 2025

What Is K8sGPT?

K8sGPT is a tool that uses large language models (LLMs), including those from OpenAI, Azure, Cohere, Amazon Bedrock, Amazon Sagemaker, Google, and Vertex, to improve the management and automation of Kubernetes clusters. It also integrates with open source Large Language Models (LLMs) like Meta LLaMA for on-premises use.

Kubernetes is an open-source platform used to automate the deployment, scaling, and operation of application containers. K8sGPT integrates with Kubernetes to provide intelligent insights, automating routine tasks, and improving operational efficiency.

K8sGPT uses LLMs to analyze logs, monitor performance metrics, and predict potential issues before they escalate. This helps in maintaining the health and performance of Kubernetes clusters, reducing downtime, and ensuring optimal resource utilization.

K8sGPT is open source under the Apache 2.0 license. It has over 5K GitHub stars, over 80 contributors, and has been accepted as a Sandbox project by the Cloud Native Computing Foundation (CNCF).

You can get K8sGPT from the official GitHub repo.

Key Features of K8sGPT

The tool offers the following features:

Multi-LLM integration: Integrates multiple large language models, offering flexibility and scalability for various Kubernetes environments. Users can choose models from OpenAI, Azure, Cohere, or Meta LLaMA.
Automated diagnostics: Analyzes Kubernetes logs and error messages to automatically diagnose issues within a cluster. Identifies problems related to networking, storage, or other system components, reducing the time required for troubleshooting.
Performance monitoring and insights: Constantly monitors the performance metrics of Kubernetes clusters, helping users maintain the health and efficiency of their infrastructure. Provides real-time insights, enabling teams to address issues before they impact operations.
Proactive issue detection: Using machine learning models, K8sGPT predicts potential problems before they escalate, such as performance bottlenecks or resource mismanagement. This proactive approach minimizes downtime and helps in efficient resource allocation.
User-friendly interface: Comes with a simple, intuitive user interface that allows operators and developers to interact easily with Kubernetes clusters. This helps teams quickly access diagnostics, insights, and recommendations without needing deep expertise in Kubernetes or AI models.
Open-source community support: Being an open-source project, K8sGPT benefits from a vibrant community of contributors. This ensures rapid development, regular updates, and access to a wealth of shared knowledge and best practices, backed by its inclusion in the CNCF Sandbox program.

How K8sGPT Works

K8sGPT operates similarly to an experienced Site Reliability Engineer (SRE), providing continuous monitoring and analysis of Kubernetes clusters to detect anomalies and potential issues. It starts with a data collection process where it selectively gathers information from the clusters. It ensures that only relevant data is used, maintaining privacy and security by anonymizing collected data and filtering out unnecessary information.

Once the data is collected, K8sGPT uses the LLM of your choice to interpret and analyze the information, much like an SRE would. For example, if a pod isn’t running, K8sGPT checks the event stream to identify possible causes, such as a missing service account in a replica set. This allows it to generate precise problem explanations using generative AI models, sometimes uncovering issues that even seasoned SREs might overlook.

K8sGPT supports integrations with OpenAI, Azure, Cohere, Amazon Bedrock, Amazon Sagemaker, Google Gemini, and Vertex. It anonymizes pod names before sending prompts to these providers, ensuring data security. It also allows connections to local models, catering to organizations that prefer not to send their data externally.

Komodor | K8sGPT: Improving K8s Cluster Management with LLMs

Itiel Shwartz

Co-Founder & CTO

Itiel is the CTO and co-founder of Komodor. He’s a big believer in dev empowerment and moving fast, has worked at eBay, Forter and Rookout (as the founding engineer). Itiel is a backend and infra developer turned “DevOps”, an avid public speaker that loves talking about things such as cloud infrastructure, Kubernetes, Python, observability, and R&D culture.

Based on my experience, here are a few ways to make better use of K8sGPT in your organization:

Use custom metrics:

Beyond default metrics, set up custom metrics relevant to your applications to allow K8sGPT to provide more precise insights and recommendations.

Implement security scanning:

Use K8sGPT to automate security scans of your containers and Kubernetes configurations, ensuring compliance with best practices and identifying vulnerabilities.

Use policy as code:

Combine K8sGPT with tools like OPA (Open Policy Agent) to automate the enforcement of policies across your Kubernetes environments, ensuring consistency and security.

Set up anomaly detection alerts:

Configure K8sGPT to alert on detected anomalies in real time, enabling quicker response times to potential issues.

Train custom models:

For large-scale Kubernetes environments, consider fine-tuning custom LLMs tailored to your specific Kubernetes workloads and integrate them with K8sGPT for specialized insights.

Quick Tutorial: Getting Started with K8sGPT

This tutorial is adapted from the official K8sGPT documentation.

Step 1: Installation

To install K8sGPT on your Linux or Mac machine, you will use Homebrew, a popular package manager. Follow these steps to ensure a smooth installation process:

Install Homebrew (if not already installed):
- For Mac: Visit the Homebrew website at https://brew.sh/ and follow the instructions to install Homebrew.
- For Linux: Visit the Homebrew for Linux page at https://docs.brew.sh/Homebrew-on-Linux and follow the instructions. Homebrew for Linux also works on Windows Subsystem for Linux (WSL).
To install K8sGPT, open your terminal and add the K8sGPT repository to Homebrew by running brew tap k8sgpt-ai/k8sgpt.
Then, install K8sGPT by running brew install k8sgpt.

Step 2: Set Up a Kubernetes Cluster

To try out K8sGPT, you need a Kubernetes cluster. You can set up a local cluster using tools like KinD (Kubernetes in Docker) or Minikube. Below are the steps for setting up a KinD cluster, which is useful for local testing:

Open your terminal and run the brew install kind command to install KinD via Homebrew.
After installing KinD, create a new Kubernetes cluster by running kind create cluster –name k8sgpt-demo. This command will set up a local Kubernetes cluster named k8sgpt-demo, which you can use for testing K8sGPT. If you are using minikube, the following command would work:

minikube start -p k8sgpt-demo

Step 3: Authenticate with OpenAI

To leverage the AI capabilities of K8sGPT, you need to authenticate with an AI provider, such as OpenAI. Follow these steps to authenticate with OpenAI:

Run the k8sgpt generate command to generate a URL for token creation.

Follow the URL from the command line to your browser, where you can generate the token on the OpenAI website.
Once you have the token, authenticate with OpenAI by running k8sgpt auth add --backend openai --model gpt-4-turbo.

You will be prompted to enter the token you generated. After entering the token, you should see a success message confirming that OpenAI has been added to the AI backend provider list.

Step 4: Learn How to Use K8sGPT

K8sGPT provides a variety of commands to interact with and analyze your Kubernetes cluster. You can view all available commands and their usage by running k8sgpt --help. This will display a list of commands along with a brief description of each.

Step 5: Analyze the Cluster

Ensure you are connected to the correct Kubernetes cluster before analyzing it with K8sGPT. For this example, use the KinD cluster you set up earlier:

1. Check the current Kubernetes context and ensure you are connected to the KinD cluster:

kubectl config current-context
kubectl get nodes

This will display the current context and the nodes in your cluster.

2. To demonstrate K8sGPT’s capabilities, create a pod with an intentional error. Create a new YAML file named bad-pod.yml with the following contents:

apiVersion: v1
kind: Pod
metadata:
  name: bad-pod
  namespace: default
spec:
  containers:
    - name: broken-pod
      image: nginx:1.19.6
      ports:
        - containerPort: 80
          protocol: TCP
      readinessProbe:
        httpGet:
          path: /
          port: 80
        initialDelaySeconds: 5
        periodSeconds: 5

In this configuration, the readiness probe is set up incorrectly. It should be a liveness probe to check if the pod is alive, not ready. This will cause an error as the readiness probe will continuously fail.

3. Apply this configuration by running kubectl apply -f bad-pod.yml

4. Use k8sgpt analyze to analyze the cluster and identify issues. This command will scan the cluster and list any detected problems. For the broken pod example, it will highlight the error related to the incorrect container image.

5. To explore additional flags and options for the analyze command, use k8sgpt analyze -h.

6. For a detailed explanation of the issues, run k8sgpt analyze --explain.

Simplifying Kubernetes Management with Komodor & KlaudiaAI

Komodor is the Continuous Kubernetes Reliability Platform, designed to democratize K8s expertise across the organization and enable engineering teams to leverage its full value.

Komodor’s platform empowers developers to confidently monitor and troubleshoot their workloads while allowing cluster operators to enforce standardization and optimize performance.

By leveraging Komodor, companies of all sizes significantly improve reliability, productivity, and velocity. Or, to put it simply – Komodor helps you spend less time and resources on managing Kubernetes, and more time on innovating at scale.

If you are interested in checking out Komodor, use this link to sign up for a Free Trial.

Latest Articles

Where Should Your AI SRE Prove Its Value?

Adopting an AI SRE is a decision most teams don’t take lightly. By the time you’re evaluating one, you’re probably already feeling the pressure: incidents are taking too long to resolve, infrastructure costs are creeping upward, and the entire development team is spending too much time keeping systems running instead of building new things.

What’s It Like When AI Helps Solve Incidents the Way Engineers Do

Reliability isn’t just about uptime. It’s about how quickly you can understand what’s happening, especially when the problem isn’t obvious.

Exit Codes in Docker and Kubernetes: The Complete Guide

Container crashing with no clear reason? Learn what container Exit Codes actually mean and how to fix the most common ones fast.