Kubernetes Taints and Tolerations: A Practical Guide

What Are Kubernetes Taints and Tolerations? 

Kubernetes taints are a feature that allows nodes (physical or virtual machines) in a Kubernetes cluster to repel a set of pods. In other words, they ensure that only certain pods can schedule onto certain nodes. This is achieved by marking the nodes with a taint, which then repels pods that do not tolerate the taint.

The concept of tolerations complements taints in Kubernetes. A toleration is a feature that allows a pod to schedule onto a node with a matching taint. Tolerations and taints work together to ensure that pods are not scheduled onto inappropriate nodes.

The taints and tolerations concept in Kubernetes is very useful in multi-tenant environments, where you don’t want certain pods to use resources from other pods. It can also be used when you have special hardware that needs to be reserved for specific tasks, and you don’t want other pods to use it.

This is part of a series of articles about Kubernetes management

Use Cases for Taints and Tolerations 

Dedicated Nodes

In a Kubernetes cluster, there might be nodes that are dedicated to specific tasks. For example, you might have a set of nodes that are dedicated to running database pods, and you don’t want other types of pods to use these nodes.

By applying taints to these dedicated nodes, you can ensure that only pods with the corresponding tolerations can be scheduled onto them. This is a powerful way to manage resources in a Kubernetes cluster.

Nodes with Special Hardware

Another common use case for Kubernetes Taints is in situations where you have nodes with special hardware. For example, you might have nodes with GPUs that are needed for machine learning tasks.

In such cases, you can apply taints to these nodes and ensure that only pods with the appropriate tolerations can be scheduled onto them. This prevents other pods from using the special hardware, ensuring that it is reserved for the tasks that need it.

Taint-Based Evictions

Taint-based evictions are used to gracefully remove pods from nodes under certain conditions. This is particularly useful when you need to clear a node for maintenance, upgrade hardware, or rebalance the workload. When a taint is applied to a node with the NoExecute effect, any pods that do not tolerate the taint are evicted immediately or after a specified grace period. This mechanism ensures that pods are not abruptly terminated, allowing for a controlled shutdown process.

Related content: Read our guide to cluster autoscaler

Kubernetes Taints and Tolerations vs. Pod Anti-Affinity vs. Node Affinity 

While Kubernetes Taints and tolerations are powerful features for managing resources in a Kubernetes cluster, they are not the only tools available. Kubernetes also supports pod anti-affinity and node affinity, which provide additional ways to control how pods are scheduled onto nodes.

Pod anti-affinity allows you to prevent certain pods from being scheduled onto the same node. This can be useful in scenarios where you want to ensure high availability of your applications.

Node affinity allows you to specify that certain pods should be scheduled onto certain nodes. This can be useful when you have nodes with special hardware or when you want to ensure that pods are scheduled onto nodes in a specific geographic region.

It is important to note that taints and tolerations don’t ensure pods will be scheduled on the nodes, they only specify that specific pods can run on specific nodes and reject all the others. Affinity, by contrast, can enforce that pods will run on only on specific nodes.

How to Use Kubernetes Taints and Tolerations 

Taints are applied to nodes and tolerations are applied to pods. When a pod with a toleration for a certain taint is scheduled, it can run on a node with that taint.

To assign a taint to a node, you can use the kubectl taint command. Here’s an example:

kubectl taint nodes node1 key=value:NoSchedule

In this example, node1 is the name of the node we’re tainting, key=value is the taint, and NoSchedule is the effect, which means that no new pods will be scheduled onto this node unless they tolerate the taint.

Note: You can use tags for key=value, for example: environment=dev

To assign a toleration to a pod, add the tolerations section to your pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: kubenode
spec:
  containers:
  - name: my-container
    image: nginx:latest
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"

In this pod spec, the pod kubenode will tolerate the taint key=value:NoSchedule.

Best Practices for Kubernetes Taints and Tolerations 

Now that we understand how to use Kubernetes taints and tolerations, let’s look at some best practices that will help you manage cluster resources more effectively.

Label Nodes Clearly and Descriptively

Labels are key/value pairs (also known as tags by cloud providers) that can be attached to Kubernetes objects, including nodes and pods. They can be used to organize and to select subsets of objects. When it comes to taints and tolerations, labels are very helpful. By clearly and descriptively labeling your nodes, you can better understand the purpose of each node and why it might have certain taints.

This practice also makes it easier to apply taints and tolerations. For example, if you have a group of nodes that are dedicated to running a specific type of workload, you can label these nodes and then apply a taint to all nodes with that label in one command.

Document Taints and Tolerations

Keeping a record of your taints and tolerations is critical for maintaining a healthy and efficient Kubernetes cluster. As your cluster grows, it can become difficult to remember why certain taints and tolerations were applied.

Documentation helps to prevent misunderstandings and mistakes and makes it easier for new team members to understand your cluster configuration. It’s also useful for troubleshooting. If a pod is not being scheduled as expected, your documentation could help you determine if a missing or incorrect toleration is the cause.

Avoid Tainting All Nodes

It can be tempting to apply taints to all nodes in your cluster to control where pods are scheduled. However, this can lead to problems. For example, if all nodes are tainted, it may be impossible to schedule some pods, causing them to remain pending indefinitely.

Instead, try to use taints sparingly and strategically. Remember that the purpose of taints is not to prevent pods from being scheduled, but to ensure that the right pods are scheduled on the right nodes.

Separate Critical Workloads

Taints and tolerations can be used to ensure that critical workloads are given priority on certain nodes. For example, you might have a group of nodes that are reserved for critical workloads. By applying a taint to these nodes, you can prevent other pods from being scheduled on them.

Then, you can add a toleration to your critical pods that allows them to be scheduled on these nodes. This helps to ensure that your critical workloads always have the resources they need to run efficiently.

Avoid Overlapping Tolerations

While it’s possible for a pod to tolerate multiple taints, it’s best to avoid this if possible. Overlapping tolerations can make it more difficult to predict where pods will be scheduled and can lead to inefficient use of resources.

Instead of using overlapping tolerations, try to design your taints and tolerations so that each pod tolerates a unique set of taints. This will make it easier to manage your cluster and ensure that resources are used effectively.

Solving Kubernetes Node Errors Once and for All with Komodor

Komodor has launched a new feature, Node Termination Enrichment, targeting two main user groups: Ops teams and Developers/Data teams. For Ops teams, this feature enables understanding the impact of actions taken inside and outside Kubernetes (K8s) clusters on the availability of applications and services. This is crucial as their actions often affect the cluster without a clear link between the actions and their consequences. Developers and Data teams, on the other hand, can use this feature to quickly identify infrastructure-related failures in their applications and data engineering jobs, allowing them to determine whether to address issues themselves or escalate them. This feature brings significant value in identifying and addressing application failures due to infrastructure changes or issues.

From a technical standpoint, the Node Termination Enrichment feature enhances Komodor’s capabilities in several ways. It extends coverage beyond K8s clusters into the infrastructure layer, enriches troubleshooting of availability issues by correlating them with node terminations, and expands value in cases like orphaned pod terminations. Key features include identification of cloud providers (AWS, GKE, Azure) with enriched metadata (OS, region, zone), determination of termination reasons (such as spot interruption or autoscaling events), and analysis of the impact on services, jobs, and pods.

If you are interested in checking out Komodor, use this link to sign up for a Free Trial.

How useful was this post?

Click on a star to rate it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.