Home
Resource library
Webinars
3 Kubernetes Drift Scenarios & How to Overcome Them

3 Kubernetes Drift Scenarios & How to Overcome Them

Chen Kubani

Product Manager, Komodor

Ilan Adler

Product Marketing Manager, Komodor

Speakers Deck available for download

(Transcription below)

“Drift Happens” explores why configuration drift remains a persistent challenge—even in GitOps-enabled Kubernetes environments. The session dives into real-world cases, operational risks, and how Komodor enables visual detection, automated remediation, and root cause insights across complex, multi-cluster setups.

TL;DR – Why Drift Matters

Kubernetes drift happens when the actual state of your clusters no longer matches the desired state. Despite GitOps, drift often slips in through:

Manual “break glass” changes in prod
Failed or partial deployments
Misaligned environments across clusters
Gaps in GitOps processes and compliance structures

Key Takeaways from the Webinar

60% of GitOps-using teams still experience drift
Common drift ‘culprits; may include memory limit mismatches, failed image pulls, and manual edits
GitOps reflects intent—not reality; drift hides in the delta
Manual YAML checks and troubleshooting don’t scale
Komodor solves this by offering:
- Baseline configuration snapshots
- Visual diff views across clusters
- Automated drift detection and resolution
- Root cause analysis for failed or inconsistent deploys

Please note that the text may have slight differences or mistranscription from the audio recording.

Ilan Adler: Let’s get started. Good morning, good afternoon, good evening to everyone wherever you are in the world. I see we’re pretty spread out. Welcome to the Komodor webinar. Today we’re going to focus on “Drift Happens: Kubernetes Drift Scenarios and How to Overcome Them.”

I’ll quickly run through the agenda. We’ll start with some housekeeping and introductions, then dive into why drift happens, its impact in actual environments, best practices and strategies, a short live demo, and then wrap with a Q&A.

Just a little housekeeping before we start. Yes, the webinar is recorded — as you probably heard — and we’ll share the slides with all participants. There’s a Q&A button you can use if you’d like to ask questions. And the session should run around 40 to 45 minutes total. Alright, time to meet our speakers. I’m Ilan Adler, Product Marketing Manager at Komodor. Joining me today and leading the webinar is Chen Kubani, Product Manager at Komodor.

Chen Kubani: Hey!

Ilan Adler: Hey Chen. Chen leads the troubleshooting team and has held several product leadership roles in the past, including Head of Cloud Customer Journey and Big Data/ML Cloud Service PM roles at large public organizations. She’s usually spotted at the office with her dog, Lichie — though no promises on a guest appearance today! Chen is also the product lead for our Drift Detection and Management feature. Welcome Chen — over to you.

Chen Kubani: Thanks, Ilan. And thank you all for joining. Let’s start with a quick poll: How does your team primarily detect potential configuration drift in Kubernetes today?

Ilan Adler: Just curious to see what the responses are.

Chen Kubani: We’ll give it another 15 seconds or so. You can see the poll options and choose the one that fits your setup. Then we’ll review and talk about how different people are managing drift. Alright, I’m going to close the poll and share the results. 60% of the 10 people who answered are using built-in features of GitOps tools like Argo CD and Flux.

The rest either don’t have a consistent process or are managing drift manually — probably with kubectl diff or reactive investigation.

Ilan Adler: Interesting to see how many are using GitOps. I think Chen’s going to dig into that shortly. Back to you.

Chen Kubani: Yeah, I’m actually happy to see these results. A lot of people assume that if they’re using GitOps, they’re covered. But the fact that you’re here today shows that it’s still a pain point — even with GitOps in place. When we talk about drift, there are several concerns. First, if you’re not using GitOps, it’s really hard to track who changed what across clusters.

But even with GitOps, drift can occur between clusters — especially across different regions or environments. A change meant for dev or staging might unintentionally reach production. Or maybe the CD process failed and some deployments didn’t complete — that’s also drift.

So why does drift happen? It often comes down to scale. If you have a small number of clusters and services, you might not see it. But once your organization grows — or you’re dealing with compliance requirements across different customer environments — things get more complex, and drift starts creeping in. Manual changes are another cause. If a developer needs to fix something quickly, that can break consistency — even if you’re using GitOps. Then there are deployment issues. In large, complex environments, you might have noisy neighbors, failed deploys, or incomplete rollouts. All of these can cause drift. Let’s look at a few examples. First, we had two production environments — one in the EU and one in the US. Everything was running smoothly in the EU, but one specific service was broken in the US. After troubleshooting, we found a misconfiguration: inconsistent memory limits during deployment. You expect consistency between production environments — but these things happen. And when only one environment is broken, it can take a long time to find and compare configurations. In another example, imagine hundreds of services across multiple clusters. One failed deployment leaves a cluster with an outdated image. Now a service is running an older version, and it’s hard to catch unless you’re specifically checking for it. In a third case, a developer accidentally pushed a config meant for staging into production. It might be a misconfigured liveness or readiness probe, or CPU/memory requests that are too low or too high. That kind of misalignment can wreak havoc in production.

Ilan Adler: Just to add, all these examples are real cases we’ve heard directly from our customers. They’re not hypotheticals. The timelines shown are based on how long our customers said it took to identify and resolve these issues. Drift happens — and it happens a lot.

Chen Kubani: Definitely. So what’s the impact of drift? First, you’ll see performance and stability issues. Degraded services reduce performance and increase failure rates and downtime. Second — and we hear this a lot — troubleshooting time skyrockets. Especially when you’re dealing with large environments. We have customers with 200 clusters, where each cluster represents a store. It’s extremely hard to keep track and make sure everything is aligned after a new version rolls out. Security is another risk. If you’re not aligned, you might have outdated configurations or services exposed to known vulnerabilities. And then there’s cost. Misaligned CPU and memory settings don’t show up as bugs — but they do inflate your cloud bill. You may not even realize configurations are out of sync across environments. Now, let’s talk recommendations and techniques. Drift happens — but you can detect and respond to it. It requires some investment, mostly from platform or DevOps teams, but it’s often worth it. Start by setting guardrails: policies, automation, YAML baselines — anything that helps enforce best practices. You can also prevent manual changes. Use RBAC to lock down environments where you expect alignment. Moving to GitOps helps too. It’s hard to implement but having Git as the source of truth makes a big difference. And even if you’re not using GitOps, you can connect drift events back to the changes that caused them. We’ve seen customers restrict developer actions to avoid these issues entirely. Automation is another key. Set up monitors and alerts to catch misconfigurations early — ideally in lower environments, before they impact production. And integrate drift detection into your troubleshooting process. Teach your team to consider drift as a root cause. Add it to your internal docs or runbooks. Ask: is this happening in other clusters or environments? That awareness alone can make a big difference.

Ilan Adler: That’s great, Chen. Now let’s take a look at how Komodor helps detect and manage drift.

Chen Kubani:Yes — let’s talk about how we’re solving this at Komodor. All the scenarios I shared are based on real customer pain. That’s why we built this feature: not just to detect drift, but to make it easy to understand.We start by letting users define a baseline — a service or package that acts as the North Star. Once that’s set, you can compare it across clusters and environments. Komodor then highlights differences in a visual, intuitive way. You can quickly see config mismatches — things like image versions or CPU settings — and trace them back to specific deploys. You can even investigate the root cause. For example, Komodor might show a failed deployment where a container image couldn’t be pulled due to a registry error. You can also define exactly which packages and resource attributes to monitor. We support continuous drift checks — even in multi-cloud setups — and flag violations in real time.

Ilan Adler: While Chen sets up the demo, just a reminder — feel free to use the Q&A tab if you have questions.

Chen Kubani: Here in Komodor, you can see full visibility for each cluster. In this example, we see drift violations in the production cluster. We can drill down and view specific packages. Even across multi-cloud, you can choose which clusters and packages to monitor. For instance, if we check the Komodor agent package, we compare its presence and configuration across all clusters. Here we see one deployment has an older image and different CPU settings. When we analyze it, Komodor tells us why: a failed deploy due to a container registry issue. This closes the loop — not just what drifted, but why, and how to fix it. You can define your baseline, targets, packages, and the specific attributes to track. And you’ll get alerts and dashboards highlighting any violations. That’s the demo!

Ilan Adler: Thanks Chen — great demo. That’s a quick overview of Komodor’s drift management features. Let’s open it up for questions. We demoed this a lot at KubeCon London recently. One question that kept coming up: “We intentionally have differences between staging and production. How does Komodor handle that?”

Chen Kubani: Great question. I’ll start with the general best practices and then explain how we handle it in Komodor. The key is to define separate baselines and alerts for different environments. If your staging setup uses lower memory, define that as the baseline for staging. For production, set different expectations. Komodor lets you do this easily — based on naming patterns, regions, or custom tags. You can define alerts specific to each environment and track only the attributes that matter in that context.

Ilan Adler: That leads into another question we get: what about hybrid environments — like when customers have a mix of on-prem and cloud clusters across regions?

Chen Kubani: That’s a great one. Kubernetes resources work similarly across environments, but hybrid setups introduce more inconsistency — which increases drift risk.

Drift detection helps unify monitoring across those diverse platforms and ensures configuration consistency — whether you’re running on-prem or in the cloud. But again, you’ll need to define what “consistent” means for your setup and invest some time in configuration.

In Komodor, it’s easy. Without it, you’d need custom scripts and tooling.

Ilan Adler: One last one — and final call for any remaining questions. Chen, you mentioned earlier that GitOps isn’t always enough. Do we have other examples of where GitOps falls short?

Chen Kubani: Yes — this comes up a lot.Platform teams usually understand GitOps won’t catch everything. Failed deploys, CI/CD errors, and multi-cluster applications don’t always surface in Git. Everything can look fine in Git — but still be broken in production. And unless there’s awareness — unless developers and platform teams know to check for drift — it’s easy to miss. That’s why it’s so important to bake this into your troubleshooting and monitoring processes.

Ilan Adler: Exactly. And like we said, this usually becomes a real issue at scale. When you’re managing just a few clusters, it’s manageable. But once you scale up — drift becomes a real problem. Alright, I think we’re out of questions. If anyone has more, feel free to email us. You’ll get the recording and slides shortly. Thanks again to Chen for walking us through this.
Wishing everyone a good night or great rest of your day. Cheers!

Chen Kubani: Thank you all! Reach out any time — we’re here to answer your questions or follow up. Bye!