Kubernetes Policy Enforcement with Open Policy Agent

Programatically enforcing best practices

Indu Subbaraj
Bluecore Engineering
6 min readJun 25, 2021

--

For the past year, Bluecore’s engineering team has been building our Kubernetes ecosystem as we migrate our core services from Google App Engine, Google’s serverless offering, to Google Kubernetes Engine. One of our guiding principles as we’ve made critical infrastructure decisions has been streamlining the Kubernetes developer experience. We want it to be extraordinarily simple for an engineer to spin up a new service, deploy it to our clusters, and monitor it with our observability tools. This not only improves developer velocity but also ensures our services are adhering to vetted infrastructure patterns, such as using Workload Identity for service account authentication or utilizing internal tracing libraries. In addition to creating and promoting adoption of these development happy paths, we also wanted to adopt a programmatic approach to enforcing best practices. This led us to the world of policies and Open Policy Agent (OPA), an open-sourced policy engine that we’ve integrated into our platform to enforce policies in our Kubernetes software development life cycle (SDLC).

What Is a Policy?

Best described here, policies are rules that govern software. Policies can be defined at any layer of a technical stack and exist because they allow us to build compliant, technically sound, and scalable systems. Example policies include mandating labels on Kubernetes resources, provisioning infrastructure only through Terraform, or requiring distinct hostnames on Ingress routes. The best way to enforce policies in a scalable manner is to define them as code.

What Is Open Policy Agent (OPA)?

The most prevalent open-sourced tool we found to implement policy as code was OPA. Simply, it is a policy engine that separates policy decisions from application code, allowing it to integrate at many points of a pipeline. OPA uses Rego, a declarative language, to define its policies. The following diagram illustrates how a service (which can be anything from the Kubernetes API server to a CI/CD pipeline to a database) can query OPA to evaluate arbitrary input JSON against policies.

Source: https://www.openpolicyagent.org/docs/latest/

An example workflow may be as follows:

  1. The Kubernetes API server receives a request to create a Kubernetes deployment resource in the “default” namespace.
  2. The API server sends a request to the OPA server to verify that the deployment adheres to all relevant policies (the request contains the deployment as JSON).
  3. The server runs the deployment JSON against the policies in its store and finds that the JSON violates a policy mandating that resources must have a namespace other than “default.”
  4. The OPA server returns a “deny” response to the API server.
  5. The API server rejects the request and the Kubernetes deployment config is not applied.

Check out OPA documentation for more details on how the tool works.

Integrating OPA Into Our Kubernetes SDLC

Our initial use case for OPA was enforcing policies around our Kubernetes clusters. We wanted to be able to enforce these policies at three distinct stages of the software development life cycle: during development (before the Kubernetes resource is created), during deployment (real-time as the resource is being created), and maintenance (audit for already existing resources). Below, we describe how we set up our infrastructure to achieve this.

Create centralized policies repo

We first created a central repository to contain all our policies. These policies are written in Rego, OPA’s query language, and are all unit tested. The repo’s CI runs these unit tests on every commit.

A typical workflow of adding a Rego rule would be as follows:

  1. Create a new folder under policy/kubernetes/ and name it using the hyphenated format of my-policy-name.
  2. Define your policy in src.rego. The package should be your policy name in the format of my_policy_name.
  3. Test your policy in src_test.rego.
Example Rego policy
Example Rego unit test

We use conftest, an open-sourced framework that allows us to test Kubernetes YAML against Rego policies, to push the policies to Google Container Registry as a step in our CI. From there, we can pull the policies when needed for evaluation.

Integrate OPA with CI pipeline

At Bluecore, Kubernetes resources are defined in central manifest repositories and deployed to our Kubernetes clusters via Argo CD. We use conftest to validate all Kubernetes manifests against our policies in a policy validation CI step.

These steps run on commits to every branch so changes that don’t pass our policy checks are surfaced early in the SDLC and before they are deployed to our clusters.

Integrate OPA with Kubernetes

While the majority of our Kubernetes resources are validated with conftest before being deployed via Argo CD, resources may still be applied directly to our Kubernetes clusters. We want to ensure our policies are applied to resources in this scenario as well. To achieve this, we use OPA Gatekeeper, an active open-sourced project, that allows us to enforce policies in Kubernetes clusters in real-time via an admission controller webhook.

Source: (left) https://www.openpolicyagent.org/docs/latest/ (right) drawn by author

Gatekeeper uses OPA under the hood to enforce policies and the diagram above explains how Gatekeeper is analogous to the OPA paradigm of policy decoupling.

Gatekeeper consumes Rego policies as Constraint and ConstraintTemplate Kubernetes objects. A ConstraintTemplate declares a new constraint type and contains the Rego to evaluate the policy. A Constraint is an instance of a ConstraintTemplate. Here is an example of the same Rego policy we saw above but defined within Kubernetes Constraint and ConstraintTemplate resources.

Once we had installed Gatekeeper in our clusters, we needed a way of generating and deploying the Kubernetes ConstraintTemplate and Constraint objects whenever policies were added, deleted, or updated.

To reuse the Rego policies we already had defined and avoid maintaining two copies of the same policy, we used konstraint, an open-source tool that generates Kubernetes Constraints and ConstraintTemplates from .rego files. We wrote a custom go script that, on every merge to the main branch in the policies repo, runs konstraint create policy/ — output constraints/ — dryrun, outputs all the generated Constraint and ConstraintTemplate YAML to a single file, and commits the file to our Kubernetes manifests monorepo. ArgoCD then auto-deploys the updated Constraints and ConstraintTemplates to our clusters for Gatekeeper to use. This pattern allows us to rely on the raw Rego files as the single source of truth for policies.

We also use konstraint to auto-generate documentation for our policies.

Audit existing resources with OPA Gatekeeper

Many of our existing Kubernetes resources did not adhere to the new policies we were adding. We wanted a way to easily surface these violations so our engineers could fix them.

Gatekeeper has an audit feature that runs at a regular, configurable interval and reports which resources violate existing policies. The results are stored in the Constraint resources themselves. We added a custom Argo CD health check to mark the resource as degraded if any violations of the Constraint were found. However, this still wasn’t the most user-friendly way of surfacing violations and we explored some options of reporting violations on a UI.

Violations reported in Kubernetes Constraint resource

Since Bluecore runs on Google Cloud Platform, we first tried using gatekeeper-securitycenter to surface these violations on the Google console. The limitations around search and filtering were less than ideal. We then deployed gatekeeper-policy-manager, which provides a much cleaner UI to view audit violations reported by Gatekeeper.

Gatekeeper-policy-manager UI

We designed this Kubernetes policy enforcement pipeline to make it simple to add new policies as we evolve and establish new engineering standards. It has already proven useful in ensuring we have properly labeled resources, are deploying to the right clusters, and enforcing memory and CPU limits. We will be adding policy support via OPA for other parts of our platform in the future as well; for example, Forseti is a tool built on top of OPA that allows policy enforcement of Google Cloud resources.

Interested in helping scale our platform as we continue growing? We’re hiring at Bluecore!

--

--