Managing TLS certificates in Kubernetes is no small feat, and the complexity only grows when you're dealing with multiple clusters. Ensuring secure communication, automating certificate renewals, and integrating with external Certificate Authorities (CAs) are just a few of the challenges Kubernetes administrators, DevOps engineers, and security professionals face. Enter cert-manager—a powerful tool that simplifies certificate management in Kubernetes environments, including multi-cluster setups. What is ‘cert-manager’ and How Does it Work? cert-manager is an open-source Kubernetes add-on that simplifies and automates the lifecycle of TLS certificates in Kubernetes clusters. It provides a consistent and reliable way to manage certificates, from initial issuance to renewal and even revocation, ensuring secure communication across your Kubernetes workloads. At its core, cert-manager integrates seamlessly with external Certificate Authorities (CAs) like Let's Encrypt, Venafi, and HashiCorp Vault, as well as internal PKIs, to automate the management of TLS certificates. With cert-manager, you can define resources such as Issuer, ClusterIssuer, and Certificate to manage how and where certificates are requested and used declaratively. This eliminates manual processes such as creating and uploading certificates or updating expired ones, which are error-prone and not scalable in complex Kubernetes environments. By automating certificate management, cert-manager ensures that services within a Kubernetes cluster can maintain secure communication without downtime or human intervention. In addition to simplifying certificate management, cert-manager integrates with Kubernetes-native tools like Ingress and Gateway API, enabling secure access to external-facing applications and internal microservices with minimal configuration. Why Multi-Cluster Certificate Management Matters In single-cluster environments, certificate management may already be challenging due to the number of services requiring secure communication. When scaling to multi-cluster Kubernetes environments, the complexity grows exponentially. This is especially true when clusters are distributed across regions, managed by different teams, or involve diverse workloads. In an engineering reality where multi-cluster Kubernetes architectures are becoming the norm, whether you're scaling workloads across regions or separating environments for compliance––certificate management is critical for secure, encrypted communication between services. cert-manager bridges this gap, enabling you to streamline certificate issuance, renewal, and rotation across clusters. Here are some key reasons why certificate management in multi-cluster setups is critical, it provides: Secure communication across clusters. Consistency and standardization. Automated certificate renewals. Scaling with dynamic workloads. Inter-cluster trust. Compliance and governance. By addressing these challenges, cert-manager acts as the backbone of a scalable and secure multi-cluster architecture, ensuring robust TLS management that evolves with your Kubernetes ecosystem. Show Me the Code: Setting Up cert-manager for Multi-Cluster Environments Prerequisites This example is running on Kubernetes 1.24+ which should be installed on all clusters. cert-manager version 1.13+. Admin access to clusters and kubeconfigs for configuration. Step-by-Step Guide Install cert-manager in Each ClusterUse Helm or kubectl to deploy cert-manager in all participating clusters. kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml Configure ClusterIssuerDefine a ClusterIssuer resource in each cluster for unified certificate issuance. For example: apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: your-email@example.com privateKeySecretRef: name: letsencrypt-private-key solvers: - http01: ingress: class: nginx Establish FederationUse GitOps tools like Argo CD or Flux to synchronize cert-manager configurations across clusters, ensuring consistency. Automation and Scalability Managing certificates manually is impractical, especially in dynamic, multi-cluster environments, making automation essential. GitOps workflows provide a powerful solution for automating cert-manager configurations by version-controlling resources such as ClusterIssuer and Certificate. Tools like Argo CD can automatically apply updates across clusters, ensuring consistency and reducing manual intervention. To further streamline operations, cert-manager can be configured to renew certificates automatically before they expire, preventing service disruptions. Policies for these renewals can be enforced through annotations, making the process both predictable and reliable. For dynamic workloads, wildcard certificates play a crucial role by securing services that scale automatically, reducing the complexity of managing individual certificates. cert-manager Security Considerations In multi-cluster setups, security cannot be an afterthought and needs to be well-designed from the get-go. cert-manager integrates with popular and trusted Certificate Authorities (CAs) such as Let's Encrypt and Venafi, enabling secure certificate issuance. Implementing Mutual TLS (mTLS) across clusters facilitates encrypted communication between them, safeguarding sensitive data. Many times, offboarding is equally as important from a security perspective as onboarding, and that is why support for immediate certificate revocation is equally critical to prevent misuse in the event of a compromise. Furthermore, compliance can be enforced using tools like Open Policy Agent (OPA) or Kyverno, which ensure that certificate configurations adhere to organizational security standards and regulations. These tools enable organizations to define and apply policies that govern certificate configurations, ensuring they align with internal security requirements and external regulatory frameworks such as GDPR, HIPAA, or PCI DSS. For example, policies can mandate the use of specific Certificate Authorities (CAs), enforce minimum key lengths for encryption, and ensure that certificates are rotated within defined timeframes to reduce exposure to vulnerabilities. OPA and Kyverno allow for declarative, Kubernetes-native approaches to compliance enforcement. With OPA, administrators can write Rego policies to inspect and validate certificate-related resources like ClusterIssuer and Certificate during admission control. This prevents misconfigured or non-compliant resources from being deployed in the cluster. Similarly, Kyverno simplifies policy management by providing CRD-based configurations, enabling administrators to define certificate requirements in YAML without needing to write complex code. The Chaos of Mismanaged Certificates - A Nightmare Unfolds Imagine this: a global e-commerce platform prepares for its annual holiday sale, expecting millions of users to flood their site. The infrastructure spans multiple Kubernetes clusters across regions to handle the scale and ensure availability. But just as the sale goes live, customers begin encountering "unsecure connection" warnings in their browsers. Within minutes, traffic plummets, customers abandon their carts, and social media erupts with complaints. The root cause? Expired TLS certificates on critical services. This nightmare unfolds when certificate management isn’t automated, monitored, or consistently configured. In this case, a single certificate renewal failure cascades into a multi-cluster catastrophe. The culprit? The renewal process stalled due to an unnoticed misconfiguration in the ClusterIssuer, compounded by a DNS propagation delay that prevented cert-manager’s solver from verifying domain ownership in time. Worse, there were no proactive alerts in place, leaving the operations team scrambling to identify the issue long after customers had already encountered problems. The fallout from this kind of incident doesn’t stop at customer dissatisfaction. The lack of secure connections could expose sensitive data in transit, leading to compliance violations and regulatory fines. Internal teams are left firefighting, manually reconfiguring certificates across clusters while contending with inconsistent configurations that create additional risks. The incident highlights a hard truth: in multi-cluster Kubernetes environments, manual and mismanaged certificate handling is a recipe for disaster––all while being really easy to automate with cert-manager. Troubleshooting and Monitoring This type of common scenario is why visibility is vital for effective certificate management, particularly in multi-cluster environments. cert-manager supports monitoring tools that track certificate health and lifecycle events, providing insights into potential issues. By enabling cert-manager’s Prometheus metrics and integrating them with Grafana dashboards, administrators gain a centralized view of certificate status across clusters. Alerting mechanisms, such as Prometheus Alertmanager, can be configured to notify teams of pending or failed certificate renewals, allowing for proactive issue resolution. Common problems, such as DNS challenges or solver misconfigurations, can be addressed using cert-manager’s detailed troubleshooting logs, ensuring that potential disruptions are minimized. From Theory to Practice: A Real-World Use Case A global SaaS provider leveraged cert-manager to manage certificates for applications deployed across five regions. By integrating cert-manager with Argo CD, they automated renewals and ensured zero downtime during certificate rotations. Lessons included the importance of wildcard certificates and proactive monitoring to avoid edge case failures. For the SaaS provider's dynamically scaled applications, wildcard certificates simplified the process of securing multiple subdomains without the overhead of managing individual certificates. This approach not only reduced configuration complexity but also enhanced scalability as new services were deployed across regions Prometheus metrics were integrated with their centralized monitoring dashboards in Grafana. Alerts were set up to notify administrators of certificate expirations, renewal failures, or configuration issues. This allowed the team to address potential edge case failures—such as DNS propagation delays or solver misconfigurations—before they could impact production services. Configuration Examples: apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: admin@example.com privateKeySecretRef: name: letsencrypt-prod solvers: - dns01: cloudDNS: project: my-gcp-project serviceAccountSecretRef: name: clouddns-service-account key: key.json Certificate Resource for Wildcard Domain: apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: wildcard-cert namespace: production spec: secretName: wildcard-cert-tls dnsNames: - '*.example.com' issuerRef: name: letsencrypt-prod kind: ClusterIssuer Challenges Faced: DNS Propagation Delays: During implementation, the team encountered issues with DNS propagation delays affecting the DNS-01 challenge verification process. To address DNS propagation delays during the DNS-01 challenge, cert-manager provides an option to configure the timeout settings for DNS validation. Below is an example of how to adjust the dns01 propagation timeout in the ClusterIssuer configuration: apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: admin@example.com privateKeySecretRef: name: letsencrypt-prod solvers: - dns01: cloudDNS: project: my-gcp-project serviceAccountSecretRef: name: clouddns-service-account key: key.json # Increase the propagation timeout to accommodate DNS delays timeout: 180s Solver Misconfigurations: Initial misconfigurations in the ACME DNS-01 solver led to failed certificate requests. To debug solver misconfigurations, enabling detailed logging and reviewing cert-manager's events can pinpoint issues. Below is an example of a corrected solver configuration for the ACME DNS-01 challenge: apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: admin@example.com privateKeySecretRef: name: letsencrypt-prod solvers: - dns01: cloudDNS: project: correct-gcp-project-name # Corrected project name serviceAccountSecretRef: name: correct-service-account # Corrected service account key: correct-key.json # Corrected key file You can debug issues with the following commands: # Check the logs of the cert-manager pod for errors kubectl logs -n cert-manager deploy/cert-manager -f # Describe the Certificate resource to inspect the event logs kubectl describe certificate wildcard-cert Rate Limits: The team was mindful of Let's Encrypt's rate limits, especially when managing a large number of certificates across multiple clusters. To avoid hitting Let's Encrypt's rate limits, it's essential to test in a staging environment before deploying to production. Here’s an example ClusterIssuer configuration for the Let's Encrypt staging server: apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-staging spec: acme: server: https://acme-staging-v02.api.letsencrypt.org/directory email: admin@example.com privateKeySecretRef: name: letsencrypt-staging solvers: - dns01: cloudDNS: project: staging-gcp-project serviceAccountSecretRef: name: staging-service-account key: staging-key.json By using this configuration in the staging environment, you can safely test certificate issuance and renewal without affecting production or hitting rate limits. Once testing is complete, you can switch to the production issuer. Lessons Learned: Proactive Monitoring: Implementing robust monitoring and alerting mechanisms was crucial in identifying and resolving issues promptly, preventing potential downtime. Configuration Management: Maintaining configurations as code within a GitOps workflow ensured consistency across clusters and facilitated easier troubleshooting and audits. Understanding Dependencies: Recognizing the dependencies between cert-manager, DNS providers, and ACME challenges was essential in configuring appropriate timeouts and retries to handle propagation delays effectively. This GitOps-driven approach allowed the team to define and version-control certificate configurations across clusters, ensuring that updates were synchronized and consistent. This eliminated the need for manual interventions, significantly reducing the risk of human error and operational downtime. Good Add-On Hygiene Simplifies the Kubernetes Complexity We’ve historically spoken about the myriad add-ons in the Kubernetes ecosystem that on the one hand provide great power, but need to be properly configured, managed, and maintained to reduce long-term complexity––and cert-manager is no exception. cert-manager transforms the complexity of multi-cluster certificate management into a streamlined, scalable process when done right. By automating renewals, integrating with GitOps workflows, and ensuring security across clusters, it empowers teams to maintain robust and reliable Kubernetes environments and easy management and maintenance for Day 2 operations. As the ecosystem evolves, tools like cert-manager will remain pivotal in navigating the challenges of secure, multi-cluster operation, alongside Komodor, which helps identify and pinpoint issues within your Kubernetes clusters to ensure the long-term health of your Kubernetes operations. For more insights into Kubernetes management, continue checking out Komodor’s blog, which has a diversity of posts on many of the tools in the Kubernetes ecosystem. Ready to simplify Kubernetes troubleshooting? Try Komodor for free.