Runtime Kubernetes policies in production with Kyverno

DV Engineering
DoubleVerify Engineering
4 min readNov 8, 2023

Written By: Arthur Hemon

In recent years, the number of Kubernetes clusters running within DoubleVerify increased from a handful of clusters to several hundreds. Keeping every Kubernetes workload aligned and compliant quickly became extremely demanding.

Our DevOps team faced two main challenges. First, deployment bugs caused the deletion of important workloads. Second, enforcing TLS versions on HTTPS load balancers across the entire company was difficult.

As the DevOps team, we want to improve developers’ experience on Kubernetes and want them to focus only on what is necessary, abstracting everything else.

We currently use GitLab-CI for our job automations, allowing any developer to write their own automation pipelines and customize them to their needs. But since pipelines can be defined anywhere, it is complex to enforce policies. In addition, the TLS policy is an infrastructure security feature we prefer abstracting from application developers’ concerns. Hence, we decided to use runtime Kubernetes policies that cannot be bypassed and are automatically applied.

Our first focus was to utilize an Open Policy Agent (Gatekeeper). Still, its DSL (Domain Specific Language) was hard to use and required extensive training for people to maintain the policies. We ultimately decided to use Kyverno as its policy engine was more specific to Kubernetes’ use case and had a much easier policy configuration.

Prevent deletion of critical workloads via deployment bugs

In the past, we had complex pipelines or automations delete some critical or system resources by mistake, especially in our dev environment, where there are fewer restrictions.

We deploy the following policy to block any DELETE requests on Namespace labeled “critical.” In our first policy versions, we had several issues, however:

  • Allow specific service accounts to delete.
  • Each individual resource had to be labeled, which was very repetitive.

We added an “exclude” block to ignore specific service accounts and a rule that would match “namespaceSelector,” so it would use the resource’s Namespace label and avoid repetition.

This is what our final policy looks like:

---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: deny-delete-protected
spec:
validationFailureAction: enforce
background: false
rules:
- name: deny-delete-protected-namespace
match:
resources:
kinds:
- Namespace
selector:
matchLabels:
app.kubernetes.io/priority: "critical"
exclude:
subjects:
- kind: ServiceAccount
name: my-account
namespace: my-namespace
validate:
message: "Deleting {{request.oldObject.kind}}/{{request.oldObject.metadata.name}} is not allowed as it is a critical namespace"
deny:
conditions:
- key: "{{request.operation}}"
operator: In
value:
- DELETE
- name: deny-delete-protected-workload
match:
resources:
kinds:
- Service
- Ingress
- Statefulset
- Deployment
- … # etc…
namespaceSelector:
matchExpressions:
- key: app.kubernetes.io/priority
operator: In
values:
- critical
exclude:
subjects:
- kind: ServiceAccount
name: my-service-account
namespace: my-namespace
validate:
message: "Deleting {{request.oldObject.kind}}/{{request.oldObject.metadata.name}} is not allowed in a critical namespace"
deny:
conditions:
- key: "{{request.operation}}"
operator: In
value:
- DELETE

Enforce TLS version on HTTPS load balancers

At DV, we use a list of restricted ciphers that is managed by our security team. The security team deploys a TLS policy in every GCP project, which might be different than the default TLS ciphers used in GCP Cloud Load Balancer, and we need to ensure all GCP ingresses use that list.

In Kubernetes, to use a specific TLS policy, we need to generate a FrontendConfig object pointing to the GCP policy and then annotate the Kubernetes ingress with the name of that FrontendConfig object.

The policy only matches “gce” (GCP) ingresses and ignores other types like Nginx.

We had to handle special use cases where the ingress was deleted to remove the FrontendConfig. We used the synchronize feature bit for that and actually ignored DELETE API calls to avoid re-creating the object and avoiding race conditions.

Also, we had to make the policy idempotent and ignore ingress updates when the ingress was already annotated.

This is the currently used policy:

---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: ingress-tls-frontend-config
namespace: kyverno-namespace
spec:
failurePolicy: Ignore
background: false
rules:
- name: generate-ingress-frontend-config
match:
resources:
kinds:
- Ingress
annotations:
kubernetes.io/ingress.class: "gce"
preconditions:
- key: "{{ request.operation || 'BACKGROUND' }}"
operator: NotEquals
value: "DELETE"
generate:
kind: FrontendConfig
apiVersion: networking.gke.io/v1beta1
name: new-tls-config
namespace: "{{request.object.metadata.namespace}}"
synchronize: true
clone:
namespace: default
name: new-tls-config
- name: patch-ingress-with-frontend-config
match:
resources:
kinds:
- Ingress
annotations:
kubernetes.io/ingress.class: "gce"
preconditions:
all:
- key: "{{ request.object.metadata.annotations.\"networking.gke.io/v1beta1.FrontendConfig\" || '' }}"
operator: NotEquals
value: "new-tls-config"
mutate:
patchStrategicMerge:
metadata:
annotations:
+(networking.gke.io/v1beta1.FrontendConfig): new-tls-config

Wrapping it up

Kyverno has an extremely powerful and concise policy engine. It enabled us to handle all our current use cases. We have a bunch of other policies like enforcing internal load balancer in GKE, automating node tolerations and affinity, copying secrets across namespaces, etc. (but that will be for another post).

There are some main drawbacks, though:

  • Kubernetes webhooks need to be set up carefully; if improperly configured, they can block an entire cluster!
  • Since this is not a fully-fledged programming language, some limitations will exist.
  • This is a custom DSL that requires training and lots of documentation reading.
  • Kyverno is a very active project with a lot of new features and as a result, new side effects. You need to be very careful when upgrading it or choosing a version.

What’s next for the DV DevOps team?

We will continue working on improving current setup, notably on Kyverno stability. Kyverno webhooks are very sensitive in Kubernetes, and we need absolute stability. Because we have so many Kubernetes clusters, we may consider running Kyverno outside the clusters in a centralized and local location (one central Kyverno per region). This would improve resource usage and reduce noise from other workloads or upgrades.

--

--

DV Engineering
DoubleVerify Engineering

DoubleVerify engineers, data scientists and analysts write about their work and share their experience