Drift Detection in Kubernetes

When the increasingly popular strategy of configuration as code (CaC) is used to develop infrastructure, it’s known as infrastructure as code (IaC). Today, IaC is quickly becoming entrenched in development processes, especially in conjunction with Terraform and Kubernetes. Yet, although IaC (and CaC) bring immense value, they can also lead to a major problem: configuration drift. In this article, we will take a closer look at this issue and explore different methods of keeping systems in their intended state. 

What Is Configuration Drift and How Does It Affect IaC? 

IaC involves keeping all infrastructure as code in a version control system like GitLab or GitHub. Any changes made to the infrastructure are made through code and then run using a CI/CD pipeline that automatically executes said code, tests it, and runs analysis to make sure it is secure. 

In a normal scenario, waiting for the code to execute is fine. However, if there is a production outage or similar scenario, teams might need to make changes manually, and those will not be integrated into the code. 

The resulting problem—the difference between the state of the system and the defined state of the system in your code—is configuration drift. 

Configuration Drift in Kubernetes

In the context of Kubernetes, YAML files are the configurations that can suffer from drift (including deployment.yaml, service.yaml, values.yaml, other Kubernetes objects, and Helm Charts).

If Terraform was used to create a Kubernetes cluster, then the Terraform code is what might suffer from drift. This includes YAML files related to configuration at the cluster level, like kube-proxy, the deployment YAML, and the CNI deployment YAML. It also includes the HCL (HashiCorp Configuration Language) that Terraform uses to configure Kubernetes clusters both in the cloud and on-premises.

Regardless of whether you use Terraform with Kubernetes for cluster configuration or just Kubernetes YAML (via Helm or on its own), the principles of IaC state that all files must remain on a version control system and changes should be made from there. Once those changes are complete, an automated or manual pipeline runs that will make the changes to a Kubernetes cluster. Adhering to this pipeline for all use cases prevents the possibility of configuration drift. This is the ideal workflow, but unfortunately, there will be scenarios that necessitate making changes in Kubernetes clusters directly, creating drift. 

Why Is It Important to Track and Eliminate Configuration Drift? 

Being out of sync can lead to the following consequences:

  • In the next deployment, manual changes can be reverted back, reverting back all the fixes/changes as well. Ultimately, this can bring your system to an unstable state.
  • Similarly, if there were manual security changes, these can get reverted back, opening up systems to a breach. 
  • There may be scenarios where the pipeline deploying the changes breaks down due to incompatibilities between manual changes and configuration code. Due to this breakdown, it becomes difficult to deploy any further changes. If it is not fixed immediately, this situation can spiral into a larger configuration drift because various stakeholders start making changes manually. 

Reasons for Kubernetes Configuration Drift

In addition to a lack of communication between teams, below are some of the major reasons why changes are made to a Kubernetes cluster that are not reflected in the configuration code:

1. Hotfixes in Kubernetes Objects

There can be scenarios that require developers to send a hotfix to a certain Kubernetes object (like a service or deployment)—for instance, a quick fix to allow some network calls, whitelisting IP addresses, or upscaling/downscaling deployments. If these hotfixes are not committed back to the version control, it will create drift. 

2. Manual Changes for Outages, Features, or Security 

During outages, waiting for pipelines to execute can delay the time to fix. It is common to fix pressing issues by making direct changes to the cluster, but these production fixes can be a major source of drift if they are not synced. 

Beyond outages, faster proof of concept (POC) can also push teams to make manual changes that are not committed back to version control. Additionally, urgent security actions like blocking flagged IPs, changes at the networking level like ingress (a Kubernetes object that allows traffic), and service-to-service communications policies can also contribute to drift. 

3. Operators Making Changes to YAML 

At times, operators running in clusters make changes to certain objects that may not be committed back to version control. Since operators work on the reconciliation model, they will maintain their state, and version control can go out of sync.

The Challenges That Kubernetes Configuration Drift Poses to Fleet-Scale Environments

For large-scale users, it’s not unusual to have hundreds or thousands of clusters. (For example, Chick-fil-A runs around 2,800 Kubernetes clusters at the edge, one for each restaurant.) Tackling drifts at such a large scale is an enormous undertaking. When running a large number of clusters, make sure the factors described below are standardized across clusters:

  • Kubernetes Cluster Versions: Ensuring all Kubernetes versions are running on a minimum specified version is an important step toward maintaining a stable environment.
  • Node Groups, Components, and Addons: Nodes are VMs where workloads run and must be compatible with the Kubernetes version. Running multiple addons and components on a node to ensure it is fully functional, like kube-proxy, CNI plugins, CSI plugins, CRI, logging agents, and kubelet, can also lead to drift if there are inconsistencies in their versions. 
  • Policies and Guardrails: Network policies and other guardrails are applied on clusters via policy agents like Kyverno and OPA. It’s essential to make sure they’re all in sync with your configuration.
  • Access Control and User Management: Who can access your cluster and what components of it should be controlled from a centralized place? Make sure there’s no drift in defined access. Otherwise, there can be major problems in audits. 
  • Resource Consumption Limits, Requests, and Cost: It takes a lot of configuration to control resource consumption and the amount of resources allocated, along with autoscaling configurations. Configuration drifts can undo that work and impact an organization’s bottom line. 

How to Avoid Configuration Drift

A major step in drift detection and remediation is making sure that anyone who is making a change is communicating that change back to the responsible team. Here are a few other strategies that can help with detection/remediation:

  • Having pipelines to make any changes in a Kubernetes cluster is a must. All changes should follow this pipeline, and the only exceptions should be production changes for issues that are impacting users. After production changes are made, they should be synced back.
  • GitOps tools in your pipeline arsenal can regularly sync changes and make sure anything that isn’t written in code is reverted back. The knowledge that ad-hoc changes will revert back encourages teams to commit any manual changes back to version control.
  • Removing access from users who don’t need it and limiting the number of users who can make changes during outages goes a long way toward eliminating configuration drift. 

Useful Tools

Read on for descriptions of GitOps tools that can detect and remediate drift.

Firefly

Firefly keeps track of resources and compares changes if they are present in the configuration. It can be easily integrated with communication channels to inform users immediately in the case of a drift. Firefly has a great UI that clearly shows which resources are impacted and can even point to the line of code where the change occurred. Furthermore, it’s capable of creating merge requests to automate the process of fixing the configuration drift.

ArgoCD

ArgoCD is one of the best GitOps tools. It can detect drift in deployment and sync changes automatically. Its out-of-sync feature will tell you if there are any differences in the configuration and the actual deployment. It also points out the exact difference between the configurations.

A note of caution: ArgoCD can be tricky to update, especially if deployed in a hub-and-spoke model, and it can be a potential single point of failure.

FluxCD

Flux has fewer features and less extensibility than ArgoCD, but it’s a strong GitOps tool nonetheless. Like ArgoCD, FluxCD shows the states that are not in sync. 

Keep in mind that FluxCD has limitations on its complex deployment setup. It also has a less mature user interface compared to ArgoCD. 

Version Control

GitHub, GitLab, or any other version control system is the backbone of any GitOps pipeline. No matter which version control system you choose, it can be optimized to create pipelines to link, verify, and deploy on a Kubernetes cluster or other GitOps tools.

Conclusion

Configuration drift is an extremely difficult problem to solve, especially at scale. That said, there are many tools to detect and remediate configuration drift, like ArgoCD and FluxCD. Still, if you are running hundreds of clusters, these may not be very reliable and you might have to deploy multiple instances. That’s why the best solution for configuration drift is two-fold: First, put a process of committing the changes back to version control into place and ensure teams always follow it. Second, leverage a solution that doesn’t leave any gaps. 

Komodor is a powerful Kubernetes management platform. With Komodor, novices and established cluster operators alike can track all changes to clusters, including manual changes that might otherwise be overlooked. Komodor is also optimized to help with many other aspects of Kubernetes, like visibility, debugging, security, and cost optimization.Simplify your Kubernetes troubleshooting! Sign up for your free trial of Komodor.