Organisation-Wide User Access Management with Open Policy Agent (OPA)

Published in

DigIO Australia

12 min readJul 30, 2020

Today, every company, from a small startup to a major enterprise, has one common thing to worry about — security and compliance. Among many other things, they need to make sure that they are protecting their customer’s data as well as their systems and assets from any unauthorised access. They need to make sure that they have the right access controls placed across their systems and services, especially the firms dealing with critical data.

However, placing those access controls throughout your infrastructure/services/systems comes at a cost. Based on the volume, load, and setup, this authentication and authorisation can cause overhead to your systems. It can degrade the performance of services, impacting your day-to-day operations, and customer services. Given security is paramount, organisations sometimes accept that trade-off.

Result? An UNHAPPY CUSTOMER.

But, is there a better way? Let’s explore.

Common LDAP/AD Setup in Industry

Enterprises have some sort of LDAP/AD (Active Directory) to manage the identities, roles, and groups of their staff/users.

In most cases, staff members are given privileges or entitlements to access only certain resources/assets to execute certain tasks. Entitlements are managed and modified through the AD server to either add, update, or remove their privileges. Eventually, every request or every action of the user/staff member needs to be authenticated and authorised against their AD data (and entitlements associated with it) before their actions can be executed.

For administrators, AD is the one-stop-shop to manage entitlements of all the users across the organisation.

Limitations With The Existing Setup

Given the nature of Active Directories, especially when they are on-prem, the whole setup can have certain limitations:

They are usually deployed as standalone services
The AD becomes single-point-of access for authentication and authorisation of every request of every user for every resource within the organisation
They can’t be deployed or included as part of a clusters/services to provide localised authentication/authorisation
There is no way to test any changes made in AD by administrators against the historical data to understand its impact before the changes are published globally within the organisation
AD only provides the data (groups, roles, entitlements, etc). It doesn’t provide decision-making using that data — i.e. each service will have to code that “decision-making” logic within it based on data fetched from AD to enforce the policy as shown below. This means, one change in policy/rule may require code changes in all the services — and that could be a nightmare for any organisation, especially big enterprises

All the services executing authentication and authorisation checks

Even if authorisation function is offloaded from all the services to a single “Authorisation Service” (as shown below), all the requests from all the services still need to go through that single Authorisation Service which in turn will communicate to the standalone AD server to authenticate the user. Also, any policy/rule change will still require a change in that Authorisation service

Authentication and authorization checks are offloaded to a single service

It does not provide endpoint security or endpoint access-control
For larger enterprises, that have a huge number of internal resources/assets and very large staff headcount, the load on standalone AD servers compounds as the number of resources and/or users increases
Lack of monitoring, reporting, or auditing capabilities that comes with out-of-the-box solutions. Organisations will have to implement, run, and manage those capabilities themselves if needed
AD as a shared central resource becomes the bottleneck and a single-point-of-failure to any/all services as shown below. If the AD server goes down or is not reachable because of any reason, it can disrupt the whole internal operations as users will not be able to access any resource/asset as they won’t be authenticated or authorised for any action

Single-point-of-access for policy checks

Centralised Vs Decentralised Approach

Desired attributes of Authorisation layer organisations aim for:

Latency — time taken to call Authorisation Service/AD and perform checks must be less
Consistency — Any modification in either data or policy must be consistent and reflected across all the distributed services
Enforcement — policy must be enforced based on the data and rules
Verification — data or policy modification must be verified before implementation or publishing
Availability — must have high uptime to ensure all the clusters/services can serve authorisation
Flexibility — Must be flexible for making any changes (policy and data)

Organisations usually go with two approaches, either centralised or decentralised (i.e. distributed data).

Centralised Approach — where AD/Authorisation Service will be a standalone service/provider to serve all entitlement queries.

This approach solves the problem of verification, flexibility, and consistency. However, it doesn’t solve the problems related to latency, availability, and enforcement.

Decentralised Approach — where authentication and authorisation, decision-making, and policy enforcement are distributed within services and clusters. However, AD will still serve as a central location to manage user data and any changes made there will be published to all distributed instances of Authorisation Services.

This approach (where data and policy are distributed within each cluster), solves latency, enforcement, and availability. However, problems related to verification, flexibility, and consistency would still persist.

Are you feeling like ⇩ yet?

So, is there another way to get the best of both worlds?

Enter Open Policy Agent aka OPA (pronounced “oh-pa”).

What is OPA?

Open Policy Agent (OPA) is an open-source general-purpose policy engine, created by Styra, and adopted by CNCF. With OPA, the policy as code validations is written in Rego, a declarative query language. These validations evaluate data in the context of your organisation’s security and compliance policies. For example, this means you could write a Rego policy to check pre-deployment in Kubernetes clusters to ensure whether resources would violate industry compliance standards — say, whether a Terraform configuration declares an unencrypted Amazon Web Services EBS volume. OPA allows policies to be specified declaratively, context-aware, updated at any time without recompiling or redeploying, and enforced automatically.

OPA allows security, risk, and compliance teams to adopt a DevOps style methodology and express their desired policy outcomes as code. In the context of security, this allows us a practical means in which to realise concepts such as DevSecOps.

It promotes “externalising” all the stuff from the application container to the Sidecar, which allows handling security in a generic, centralised way, not expecting each application developer to write that crucial part by itself. OPA also works exceptionally well with Sidecars for endpoint security and can be scheduled with each service sharing the same execution context, host, and identity. It can be deployed just like any other service, as a daemon. In this case, it’s recommended that you use a sidecar container or run it on the host level to increase performance and availability by reducing travel over the network.

Some common use cases of OPA are:

Policy to check deploying resources to Kubernetes clusters (e.g. using Conftest)
Kubernetes Admission Controller (e.g. via OPA Gatekeeper project)
Enforcing access control across services in a service mesh
Fine-grained security controls as code for accessing application resources
Policy-driven CI/CD Pipelines

Today, OPA is used by giant players within the tech industry. For example, Netflix uses OPA to control access to its internal API resources. Chef uses it to provide IAM capabilities in their end-user products. In addition, many other companies like Cloudflare, Pinterest, and others use OPA to enforce policies on their platforms (like Kubernetes clusters).

What is Styra/DAS (Declarative Authorisation Service)?

When companies put OPA into production, they need a user interface. They need a way to author policies, distribute policies, monitor policies, and do impact analysis. Styra’s Declarative Authorisation Service (or DAS) allows teams to define, enforce, and validate security policies (OPA does only decision-making based on policy, not policy enforcement) — DAS is operationalising OPA for enterprise deployment.

Key capabilities of DAS are:

Simplifying policy authoring (with built-in policy library)
Policy validation prior to enforcement (unit tests, policy compliance, historical decisions replay)
Policy distribution and enforcement
Continuous monitoring
Policy visualisation (Dashboards)

How Can DAS and OPA Help Achieve The Desired State?

Circling back to our problem, how can we use OPA and DAS to get the best of the centralised and decentralised approaches?

What is the desired state?

Desired state: Data remains centralised for easy management and administering through AD, but decision-making and enforcement are decentralised, as shown below:

AD data management will be centralised to ensure strong consistency, high flexibility, and verification but policy enforcement will be done through OPA instances distributed across each cluster ensuring low latency, high availability, and strong policy enforcement. Any data changes in AD will be published to distributed OPA instances dynamically.

Desired architecture

Steps to Achieve The Desired Architecture:

Generate OPA data bundles from AD data: A simple microservice (say, Sync Service) that polls LDAP/AD server at predefined regular interval to fetch any incremental changes, translates that AD data into a JSON format that can be consumed by DAS and OPA, and pushes that JSON data in a defined directory structure in a GitHub repository under a root directory
Generate OPA policy bundles: Policies are rules that drive decision making and written in REGO language. Policies can be authored in two ways for our approach — either write policies in your editor using REGO and then push them in a defined directory structure in the same GitHub repository which has the OPA bundles data under the same root directory, or writes policies directly through DAS UI which provides some additional features to support REGO language and to unit test the policies as shown below:

Sign the bundles in the repository: To ensure the integrity of the bundles and policy data, you can digitally sign the bundles. Bundles data can be authenticated by the endpoint service i.e. OPA instances itself, using these signatures. The signature can be a simple JWT token with claims as JWS objects
Mount the GitHub Repository in DAS: Once the DAS is set up, you can mount the GitHub repository — which stores the final OPA bundles (data and policy) — within it. Once DAS has GitHub repository mounted, it will continuously poll the repository for any changes and will download the latest bundles as soon as they are available
Setup Distributed OPA Clients/Servers: Once you setup DAS, it generates the boot configuration for the OPA instances. Using this boot configuration, OPA servers can be set up in a distributed fashion i.e. as a sidecar in a pod (or container in a service) within a cluster to serve entitlements locally for that service/cluster. This will setup OPA to download bundles from DAS directly and if a signature (e.g. JWT token) file exists with bundles, OPA will first verify the signature and the content of files in bundles to authenticate the data before it consumes it to serve queries

Publishing Data/Bundles from AD -> Sync Service -> GitHub -> DAS -> OPA: Any data changes in the AD server will be pulled down by Sync Service and the output bundles will be signed and pushed to GitHub repository. Once OPA instances are up & running using the boot configuration generated by DAS, all 3 systems (GitHub, DAS, and OPA) are perfectly hooked up to sync up in near real-time for any changes. Any changes in GitHub will be downloaded by DAS and then by OPA from DAS.
Serving Entitlements Queries: Once OPA has the bundles, it is ready to serve the entitlements queries based on the request, and the bundles in memory. OPA generates decision logs for any query it executes. The decision logs contain all relevant data like request, response, bundles that are used to execute the query, the bundle version, some additional metadata, etc.

Performance

Given the way, OPA is built and the fact that it keeps the bundles in memory, actual OPA query lookups are pretty fast with execution time in sub-mili-seconds. Also since OPA can be deployed as a Sidecar, network overhead can also be reduced to have an extremely low latency response time.

One of the main concerns with using OPA is its memory footprint. Because organisations maybe installing an OPA agent on every microservice this could easily be over 1000 services for some large enterprises. The sheer amount of organisational data can result in a huge OPA bundle size causing high memory usage for each OPA instance. However, by taking an iterative approach to compress and adopt the right data mapping in bundles can help organisations reduce memory consumption significantly.

Auditing

OPA also has the ability to push the decision logs to DAS (which is set up as part of boot configuration generated by DAS) as shown below in the Styra UI. DAS stores the decision logs and also has the ability to push the decision logs to the cloud for archiving purposes.

Decision logs and OPA’s/DAS’s ability to archive the decision logs provides a perfect way to audit the policy or the AD data changes (which translate into the bundles eventually).

Monitoring

Styra/DAS has many built-in features to support monitoring through its UI. A few of them are:

Dashboard for visualisation of data e.g. decision logs
Way to author policy and test its impact before it is published
The policy can be continuously monitored and policy enforcement can be ensured
Impact analysis
Health check of distributed OPS instances

Monitor compliance violations through DAS UI

Validation

DAS solves one critical problem of validating the impact of either policy changes or AD data changes against the real production historical data before they can actually be deployed/published. It provides a feature called Historical Decision Replay and exposes an API to replay any draft changes (of policy or AD data) against the historical decision logs from production to check and measure the impact of those changes before those changes are approved or published. This is to measure the impact of changes as it would have been had those changes been published into production directly. Replay changes against historical decision data, before publishing them in production, prevents any accidental system or human error while modifying the policy or AD data.

As shown in the system architecture diagram above, Sync Service can be integrated with DAS to perform the Decisions Replay for any changes it receives from AD and then can be configured to approve/publish the changes if the measured percentage impact is less than the desired threshold limit. A similar approach can be taken to replay policy changes, which will be authored manually before the changes are approved or published.

Summary

Policy as code is an effective way to uniformly define, maintain, and enforce security and compliance standards in the cloud. Policy as code introduces programming practices such as version control and modular design and applies them to the governance of cloud resources. This has the benefit of promoting consistency and enabling automation in policy, which in turn reduces time and money spent manually remediating compliance violations.

Open Policy Agent (OPA) offers a powerful way to implement this strategy. It’s a great example of a tool that implements security policy as a code. OPA provides a uniform framework and language for declaring, enforcing, and controlling policies for various components of a cloud-native solution. OPA can be integrated into software services to decouple software from policy, avoiding the pitfalls of hard-coding policy into the software. With policy separated from code, stakeholders can more easily maintain it, and automated enforcement becomes much more viable.

Complementing that, DAS eases the adoption of OPA for enterprises by adding a management tool, or control plane, to OPA for Kubernetes with a prebuilt policy library, impact analysis of the policies, and logging capabilities. It gives DevOps teams powerful visibility and control over their cloud-native environments. From policy authoring through continuous monitoring and auditing, Styra helps secure your Kubernetes environments through compliance guardrails.

DAS and OPA together provide a powerful way to centrally manage, author, validate, monitor, and audit all use cases of access controls (be it Kubernetes admission controller or pre-deploy checks in Kubernetes cluster or user access management, etc). In addition, it provides highly available, distributed, consistent, and low latency decision-making via OPA. This ensures to enforce all security and compliance guidelines across the enterprise without adding any performance overhead to impact their daily operations and services.

Result? A HAPPY CUSTOMER.