Help with CertManager on K8s

Carlos Juan Gómez Peñalver
4 min readFeb 5, 2021

--

MV Rena https://en.wikipedia.org/wiki/MV_Rena

A comprehensive guide to understanding how it works and pattern I use to perform upgrades and debugging problems

CertManager works as any other Kubernetes Operator uses a control loop to reconcile its CRDs with the desired state.
CertManager components:
- Controller: Ensures the current state is the desired state (eventual consistency)
- CA injector: Helps to configure CA certificates
- Webhook: It works as validation and mutation admission controllers and conversion from old CRD versions into latest versions (auto migration)

Note: All the examples, patterns and flows described here are assuming that we are using AWS EKS and Kube2Iam with the following configuration alongside with ExternalDNS (Prometheus and resources not included to reduce the amount of code 😅)

Install the cert-manager Helm chart

create a values file

cat > cert-managers-props.yaml <<PROPS
global:
leaderElection:
namespace: cert-manager
priorityClassName: system-cluster-critical
rbac:
create: true
ingressShim:
defaultIssuerGroup: cert-manager.io
defaultIssuerKind: ClusterIssuer
defaultIssuerName: letsencrypt-prod
installCRDs: true
podAnnotations:
iam.amazonaws.com/role: arn:aws:iam::<MyAccountID>:role/<MyCertManagerRole>
replicaCount: 2
webhook:
replicaCount: 2
cainjector:
replicaCount: 2
PROPS

install the chart

helm upgrade --install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.1.0 \
-f cert-managers-props.yaml

When you install CertManager for the first time you have to wait a couple of minutes, some times more some times less, you can create the ClusterIssuer we have defined as IngressShim default

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: <MyTeam@MyOrg.com>
privateKeySecretRef:
name: letsencrypt-prod
server: https://acme-v02.api.letsencrypt.org/directory
solvers:
- dns01:
route53:
region: eu-west-1
selector:
dnsZones: [<My dns zones>]

Install ExternalDNS Helm chart

create the values file

cat > external-dns-props.yaml <<PROPS
domainFilters: [<My dns zones>]
provider: aws
podAnnotations:
iam.amazonaws.com/role: arn:aws:iam::<MyAccountID>:role/<MyExternalDnsRole>
policy: sync
priorityClassName: system-cluster-critical
aws:
region: "eu-west-1"
# Stop using the custom AWS A-Records no rfc compliant
preferCNAME: true
policy: sync # watch out... this will delete records
# Required when creating CNAMEs
txtPrefix: k8s-
txtOwnerId: <ThisExternalDnsDeploymentId>
PROPS

install the chart

helm upgrade --install \
external-dns bitnami/external-dns \
--namespace external-dns \
--version v4.6.0 \
-f external-dns-props.yaml

So… what happens when we deploy a new ingress with the annotations

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
# add an annotation indicating the issuer to use. Created above
cert-manager.io/cluster-issuer: letsencrypt-prod
# Or using the default ingress shim integration
kubernetes.io/tls-acme: "true"
# what Ingress controller will manage this ingress resource
# This annotation will become deprecated in the near future 0_0
# https://kubernetes.io/docs/concepts/services-networking/ingress/#deprecated-annotation
kubernetes.io/ingress.class: nginx
name: myIngress
namespace: myIngress
spec:
rules:
- host: example.com
http:
paths:
- backend:
serviceName: myservice
servicePort: 80
path: /
tls: # < placing a host in the TLS config will indicate a certificate should be created
- hosts:
- example.com
secretName: myingress-cert # < cert-manager will store the created certificate in this secret.

Certificate → CertificateRequest → Order →Challenge

Once the Challenge has succeeded theCertificate will become valid and the TLS certificate will be populated into the secret that will hold the TLS certificate attached to your domain example.com in addition cert-manager will create a TXT record to validate that you own that domain this is due to the DNS01 challenge configured on letsencrypt-prod

In parallel to that external-dns will create a CNAME record pointing to your Ingress controller Load balancer so the traffic can be managed by that controller PODs which will read the Certificate secret and serve the TLS certificate back for requests to that domain

WOOHOOO 🚀

Debugging

NOTE: Don’t panic if you have multiple ingresses with the same hostname in the same or different namespace even if they are managed by different Ingress controllers. It’s ok 😃 cert-manager will copy the same TLS certificate in all the secrets

Basic checklist:
- Do you have aCertificate CRD? is it valid?
- Check cert-manager controller PODs logs
- external-dns might cause AWS API to rate-limit access to Route53, including cert-manager this is a very extreme case but it could happen depending on the setup. Use metrics and alerts to tweak both deployments 📈
- Check GitHub issues

Certificate is not created
Checklist:
- Is cert-manager running? Are POD healthy? 😅
- Missing Issuer or ClusterIssuer annotations in the Ingress?
- Does the ClusterIssuer exist? Or the Issuer in the same namespace?
- Does your ingress have a tls block?

Certificate never becomes valid
Checklist:
- there is aChallenge that does not complete
— check IAM access to Route53 if possible with CloudTrail
- there is anOrder that does not complete
— you might have hit rate limit on LetsEncrypt check the logs
- Some times cert-manager gets into a deadlock where the certificate is on an internal queue to be processed. To be fair I’ve only seen this a handful of times and never on the 1.0+ versions… but it’s worth mentioning

Not all the Certificates are created through Ingress annotations
Check your Ingress controller configuration. e.g.: Kong uses a secret that could be generated as explained here or using a Certificate CRD

Upgrades and migrations

Ensure your cert-manager deployment is not running with enable-certificate-owner-ref this flag is disabled by default otherwise a Certificate CRD generates a secret with an ownerReference to the original Certificate that means that if the Certificate is deleted the secret will be too. If you lose the secrets you will lose your domain certificates!

1. Backup your Certificate resources
An option could be

kubectl get certificates --all-namespaces > cert.backup.yaml
kubectl get issuer --all-namespaces > issuer.backup.yaml
kubectl get clusterissuer > clusterissuer.backup.yaml

2. Delete cert-manager deployment completely

3. Deploy the new cert-manager version

4. Restore the backup resources

kubectl apply -f *.backup.yaml

cert-manager will take care of upgrading the CRDs to the newest versions thanks to the MutatingWebhookConfiguration

These are the most common case I face. I hope this can help others

--

--

Carlos Juan Gómez Peñalver

SRE engineer who loves Kubernetes, automation, open source and SRE practices. Giving back to OSS as much as possible :D (Opinions expressed are solely my own)