Help with CertManager on K8s
A comprehensive guide to understanding how it works and pattern I use to perform upgrades and debugging problems
CertManager works as any other Kubernetes Operator uses a control loop to reconcile its CRDs with the desired state.
CertManager components:
- Controller: Ensures the current state is the desired state (eventual consistency)
- CA injector: Helps to configure CA certificates
- Webhook: It works as validation and mutation admission controllers and conversion from old CRD versions into latest versions (auto migration)
Note: All the examples, patterns and flows described here are assuming that we are using AWS EKS and Kube2Iam with the following configuration alongside with ExternalDNS (Prometheus and resources not included to reduce the amount of code 😅)
Install the cert-manager Helm chart
create a values
file
cat > cert-managers-props.yaml <<PROPS
global:
leaderElection:
namespace: cert-manager
priorityClassName: system-cluster-critical
rbac:
create: true
ingressShim:
defaultIssuerGroup: cert-manager.io
defaultIssuerKind: ClusterIssuer
defaultIssuerName: letsencrypt-prod
installCRDs: true
podAnnotations:
iam.amazonaws.com/role: arn:aws:iam::<MyAccountID>:role/<MyCertManagerRole>
replicaCount: 2
webhook:
replicaCount: 2
cainjector:
replicaCount: 2
PROPS
install the chart
helm upgrade --install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.1.0 \
-f cert-managers-props.yaml
When you install CertManager for the first time you have to wait a couple of minutes, some times more some times less, you can create the ClusterIssuer
we have defined as IngressShim
default
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: <MyTeam@MyOrg.com>
privateKeySecretRef:
name: letsencrypt-prod
server: https://acme-v02.api.letsencrypt.org/directory
solvers:
- dns01:
route53:
region: eu-west-1
selector:
dnsZones: [<My dns zones>]
Install ExternalDNS Helm chart
create the values
file
cat > external-dns-props.yaml <<PROPS
domainFilters: [<My dns zones>]
provider: aws
podAnnotations:
iam.amazonaws.com/role: arn:aws:iam::<MyAccountID>:role/<MyExternalDnsRole>
policy: sync
priorityClassName: system-cluster-critical
aws:
region: "eu-west-1"
# Stop using the custom AWS A-Records no rfc compliant
preferCNAME: true
policy: sync # watch out... this will delete records
# Required when creating CNAMEs
txtPrefix: k8s-
txtOwnerId: <ThisExternalDnsDeploymentId>
PROPS
install the chart
helm upgrade --install \
external-dns bitnami/external-dns \
--namespace external-dns \
--version v4.6.0 \
-f external-dns-props.yaml
So… what happens when we deploy a new ingress with the annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
# add an annotation indicating the issuer to use. Created above
cert-manager.io/cluster-issuer: letsencrypt-prod
# Or using the default ingress shim integration
kubernetes.io/tls-acme: "true"
# what Ingress controller will manage this ingress resource
# This annotation will become deprecated in the near future 0_0
# https://kubernetes.io/docs/concepts/services-networking/ingress/#deprecated-annotation
kubernetes.io/ingress.class: nginx
name: myIngress
namespace: myIngress
spec:
rules:
- host: example.com
http:
paths:
- backend:
serviceName: myservice
servicePort: 80
path: /
tls: # < placing a host in the TLS config will indicate a certificate should be created
- hosts:
- example.com
secretName: myingress-cert # < cert-manager will store the created certificate in this secret.
Certificate → CertificateRequest → Order →Challenge
Once the Challenge
has succeeded theCertificate
will become valid and the TLS certificate will be populated into the secret
that will hold the TLS certificate attached to your domain example.com in addition cert-manager
will create a TXT record to validate that you own that domain this is due to the DNS01 challenge configured on letsencrypt-prod
In parallel to that external-dns
will create a CNAME record pointing to your Ingress controller Load balancer so the traffic can be managed by that controller PODs which will read the Certificate secret and serve the TLS certificate back for requests to that domain
WOOHOOO 🚀
Debugging
NOTE: Don’t panic if you have multiple ingresses with the same hostname
in the same or different namespace
even if they are managed by different Ingress controllers. It’s ok 😃 cert-manager
will copy the same TLS certificate in all the secrets
Basic checklist:
- Do you have aCertificate
CRD? is it valid?
- Check cert-manager
controller PODs logs
- external-dns
might cause AWS API to rate-limit access to Route53, including cert-manager
this is a very extreme case but it could happen depending on the setup. Use metrics and alerts to tweak both deployments 📈
- Check GitHub issues
Certificate
is not created
Checklist:
- Is cert-manager
running? Are POD healthy? 😅
- Missing Issuer
or ClusterIssuer
annotations in the Ingress
?
- Does the ClusterIssuer
exist? Or the Issuer
in the same namespace
?
- Does your ingress have a tls
block?
Certificate
never becomes valid
Checklist:
- there is aChallenge
that does not complete
— check IAM access to Route53 if possible with CloudTrail
- there is anOrder
that does not complete
— you might have hit rate limit on LetsEncrypt check the logs
- Some times cert-manager
gets into a deadlock where the certificate is on an internal queue to be processed. To be fair I’ve only seen this a handful of times and never on the 1.0+ versions… but it’s worth mentioning
Not all the Certificates
are created through Ingress
annotations
Check your Ingress controller configuration. e.g.: Kong uses a secret
that could be generated as explained here or using a Certificate
CRD
Upgrades and migrations
Ensure your cert-manager
deployment is not running with enable-certificate-owner-ref
this flag is disabled by default otherwise a Certificate
CRD generates a secret
with an ownerReference
to the original Certificate
that means that if the Certificate
is deleted the secret will be too. If you lose the secrets
you will lose your domain certificates!
1. Backup your Certificate
resources
An option could be
kubectl get certificates --all-namespaces > cert.backup.yaml
kubectl get issuer --all-namespaces > issuer.backup.yaml
kubectl get clusterissuer > clusterissuer.backup.yaml
2. Delete cert-manager
deployment completely
3. Deploy the new cert-manager
version
4. Restore the backup resources
kubectl apply -f *.backup.yaml
cert-manager
will take care of upgrading the CRDs to the newest versions thanks to the MutatingWebhookConfiguration
These are the most common case I face. I hope this can help others