AuthZ: Carta’s highly scalable permissions system

How we built an authorization system based on Google Zanzibar

Aaron Tainter
Building Carta
12 min readJun 11, 2021

--

Permissions, also known as authorization, is the process of granting access to resources in your system. For any team, it’s crucial to get permissions right. At Carta, where we are working with financial data all day, it’s the most important thing.

But we had a problem. Instead of maintaining one legacy system, we were maintaining five. The permissions could conflict — and they were impossible to extend. Our business needs were growing, and we had several new products in the funnel.

It was obvious we had to build a new authorization system to create leverage for engineering and product. We knew we needed it to be three things:

  1. Scalable
  2. Fast
  3. Generic enough for any new product needs

Sounds simple enough, but in reality it’s not that easy. In my career, I’ve seen permission systems that are too simple. They lack the features to support fine-grained access on single resources. I’ve also seen them too complex. One small change might unravel a whole policy of attribute-based permissions.

In this article, we’ll look at how my team — Identity and Access Management — took a creative approach to avoid those pitfalls by rebuilding Carta’s permissions system based on Google Zanzibar.

Experimenting with tokens

Our work began in mid-2019 when we started decomposing our monolithic application. Engineers were developing decomposed services and needed a way to authorize incoming requests. Permission data was still coupled to the legacy permission system in the monolith.

Our first experiment was a distributed, token-based access control system. We passed user permission data between services with an encoded JWT token. Distributed tokens made access checks fast, but there was a huge cost to build them. Token build times had a very long tail and caused performance issues for power users with large permission sets.

A snapshot of token build times

Some tokens reached almost 1mb (!!) in size. Almost every request payload had an embedded token and response times would suffer as large tokens passed between services.

Additionally, the authorization token was not extendable. Service owners had to build their own system for new permission sets. We decided to scrap the authorization token and investigate other options.

Navigating complex permissions

During our discovery process, we evaluated both open source and vendor products. Many of the projects we evaluated were not granular enough to handle our use cases, so we ruled them out. Then we turned our focus on avoiding vendor lock-in.

We identified a promising project named Open Policy Agent (OPA). OPA is a framework that provides a single interface for authorization checks.

OPA serves as a system of record for authorization policies. Domain services push code-defined permission policies to OPA. Network traffic passes through OPA and policies are used to either authorize or deny incoming requests. OPA is built for distributed systems, so it scales well across a cluster.

OPA has some downsides. It requires a large set of self-managed infrastructures. Permission checks are also slow. With OPA, the consumer controls the underlying data source. You store policy code in OPA, but policies query domain-level data for access checks. This worked fine for simple permissions, but complex ones caused major issues for us.

Complex permissions often queried several data models and added hundreds of milliseconds to response times. That wasn’t going to work for us.

Role explosion

Other projects failed to meet the requirements for our data because they operated on role-based access control (RBAC). RBAC grants access to users via permissive roles. Roles break down with large data sets due to a phenomenon called “role explosion.”

Role Explosion

Role explosion is caused when users have many roles for several different system entities. (For example, if a user has four roles for 10 different entities, the user has 40 managed roles.) This number has no upper bound and can become millions of managed roles if there are thousands of entities in the system.

At Carta, user permissions are complex because they’re scoped to portfolio assets. A user has access to their individual portfolio for a set of shares they own. They also may access securities through a fund that is invested in the company. Or the fund might even invest in another fund which owns shares in the company. On top of all that, the company might have access to your portfolio so they can administer securities.

Users have permission on each portfolio they can access. With millions of portfolios and securities on the platform, our permissions system has to manage a large amount of individual roles.

Enter: Zanzibar

Mid-2019, Google released a white-paper titled “Zanzibar: Google’s Consistent, Global Authorization System.”

Google uses Zanzibar to service millions of authorization requests per second. Several high-traffic products, including YouTube and Drive, use Zanzibar for authorization. Zanzibar is powerful because it is scalable, flexible, and fast.

One of Zanzibar’s core features is a uniform language that is used to define permissions. Zanzibar consumers use the uniform language to build Access Control Lists (ACLs). ACLs are like unix file permissions. They give users access to individual resources in the system.

With Zanzibar, services compose abstractions for user permission groups. User permission groups can compose each other. They also grant access to low-level resource ACLs.

Unlike OPA, Zanzibar is the source of truth for the data. Consumers simply add permissions without thinking about how to implement fast lookups.

Zanzibar met all our requirements. It was scalable, fast, and generic.

But we had a problem: There weren’t any open source implementations available. We decided to build our own system, adding several of our own modifications to make it an even better fit for Carta.

Our implementation

At this point, I’ll start referring to our next-gen authorization system as AuthZ. AuthZ stands for authorization.

AuthZ configures permissions by adding and removing RelationTuples. A RelationTuple is a tuple consisting of an actor, relation, and object.

  • Actor: The entity performing the action (e.g. User:Bob). An actor can also be a grouping of entities (e.g. Group:7 Members). It can even be a non-user (e.g. Service:Notification-Service).
  • Relation: The action performed on an object (e.g. view, edit, add_card, etc.).
  • Object: The entity the actor is acting upon (e.g. Corporation:7).
A RelationTuple is represented as two nodes in a permissions graph

Each part of the RelationTuple is freeform, so we don’t limit the type of permissions that you can set with AuthZ. For example, “read” and “issue_certificates” are both valid permissions.

Consumers combine RelationTuples using a domain-specific language. RelationTuples create direct access between actors and objects. RelationTuples automatically imply indirect access between actors and objects. I’ll explain later how this makes AuthZ so powerful.

  • Direct Relation: A row in the database that has been explicitly set by a client
  • Indirect Relation: A row that does not exist in the database, but the graph gives access to the actor through an indirect relation.
Direct relations vs. indirect relations

RelationTuples construct a permission graph. Consumers query the permission graph to determine if an actor has access to an object. An actor has access to an object if there is either a direct or indirect relationship between the actor and the object. (A good mnemonic for this process: “Actors act on Objects.”)

Traversing the graph is expensive, so AuthZ maintains a custom secondary index. The secondary index responds to access checks on the permissions graph in sub-10ms response times.

The architecture of this system is out of the scope of this article. We’ll post about it later, so follow Building Carta for updates.

Building an MVP

At Carta, we start every new project by building a minimum viable product (MVP). Internal services are no exception. MVPs help your team collect valuable early feedback. Early feedback prevents your team from spending time building the wrong features.

Our goals were:

  • Validate latency metrics
  • Customer usability feedback
  • Identify risks and roadblocks

Our team started by aggregating several requirements from our consumers. At that time our authentication token was still used by most domain services to query legacy permissions. Most teams wanted a way to build new permissions.

We started implementing a basic version of the index algorithm from Zanzibar. Our first goal was to test this system on new permissions.

Our initial tests

Speed was one of the primary requirements for AuthZ. We built a prototype of the Zanzibar index to verify that we could achieve a similar level of performance.

Once we built the prototype, we began testing. We conducted several tests with our version of the Zanzibar index. We compared AuthZ’s custom index to a couple alternatives for a baseline metric.

A snapshot from one of our early performance tests

The lookup time using the Zanzibar custom index is very fast. Checks on dense graphs aren’t much slower than checks on sparse graphs and the index is an order of magnitude faster than an alternative optimized caching strategy.

There is an initial cost for building the custom index as seen by the higher average time to add an edge.

Eventually, this became a non-issue since AuthZ builds the index with an asynchronous pipeline. Consumers don’t have to wait for the index updates to propagate through the system. To reduce load, index updates are also distributed among several workers. As the number of consumers increases, we can scale up workers to process index events faster.

Consumer roadblocks

At this point, our initial product was simple. There were only three endpoints.

  • Add: Push a RelationTuple into the index
  • Remove: Remove a RelationTuple from the index
  • Check: Query the index for a permission

AuthZ’s Check endpoint performance impressed our consumers, but they felt like the API lacked features. Most use cases added several RelationTuples. Consumers called AuthZ tens of times during an update. Each call would add an individual RelationTuple.

There were also some concerns around scalability. At this point the index was not built asynchronously. Thus, updates were slow. Since permissions were added individually, a full update could take several seconds.

Finally, one of the biggest concerns was lack of visibility into the system. There was no way to inspect permissions once they were pushing into AuthZ. It was impossible to migrate an existing permission model without this feature.

Our consumers were reluctant to use AuthZ in a production environment.

This was hard feedback to hear. But this is why Carta works iteratively. They weren’t saying “No.” They were saying “Not right now.”

We knew that if we added a couple of features (to our MVP, which was already deployed), we could get the adoption we wanted.

Make improvements

The initial feedback was disappointing, but valuable. A few comments stood out to us:

  • Permissions are easy to add
  • There is single source of truth
  • Checks are fast

This feedback was a positive indicator for our initial requirements. The MVP addressed most of the problems that we set out to solve. The scaling issues weren’t a big deal. We could solve those by implementing more features from the Zanzibar document.

The MVP identified something we did not predict: Our consumers wanted tooling. They preferred software which helps them use the system, not features on the system itself. We quickly prioritized tooling due to the demand coming from our consumers.

Our consumers also had issues adding RelationTuples. They didn’t want to make 50 separate requests to bootstrap permissions. Rather they wanted to add multiple permissions in a single call. This customer request became the Template feature.

Visualizer

During our initial tests, we built some tools to debug issues with the index. One of the tools was a simple graph visualizer. Given a set of RelationTuples, it was able to generate a JPEG image of the graph.

A prototype graphing tool we used for early testing

We thought about creating an endpoint to serve this image to our consumers, but quickly decided it was a bad idea. We decided to build query APIs that let consumers view portions of the graph. These APIs acted as our backend. We then created Concord, a web app that acted as our presentation layer to visualize nodes in the graph.

Our visualizer tool in Concord

This feature alone addressed most of the concerns around discoverability, with the ability to identify and inspect permissions previously added.

We’ll share more information about Concord in a future blog post.

Namespaces

During our test sessions, we noticed some services botched other services’ permissions. This happens when two services apply updates to the same permission type.

For instance, one service might add a permission for a bank account. Another might add a permission for a user account. Both might try to write a permission with the actor type: “Account.” One service could overwrite another service’s change.

An example of two services writing to the same permission type

We introduced namespaces to separate domain-level permissions. Namespaces prevent permission overwrites, but promote cross-domain queries. Services write to one or many namespaces where they have write access. If a service attempts to write to a namespace they do not have access to, AuthZ rejects the update and throws an error.

Permissions can connect across namespaces

Carta uses shared namespaces for a few use cases. For instance, user permissions are in one namespace; document permissions are in another. AuthZ grants the document service access for the Identity public namespace. Document service can give a user access to a document permission.

This is a major advantage of having a global permission store.

Templates

We solved the bulk update problem (add multiple permissions) by introducing RelationTuple templates. Templates allow consumers to predefine a set of changes that are applied in AuthZ. Templates make updates explicit and repeatable.

Consumers construct templates using Mustache templating language and pass templates to AuthZ with a set of data when a change occurs. AuthZ injects data into the template and executes a bulk update event.

We encourage services to use separate templates for each use case.

Here’s an excerpt from a template we use for AuthZ meta permissions:

AuthZ uses this template to add permissions for a new namespace. AuthZ injects data for the change event. In this case, the namespace ID and any services that have access to the namespace. The template applies a bulk update to the permissions graph. Future AuthZ calls will include the RelationTuples applied by the template.

Flatten

In some use cases, services display a list of resources that a user has access to (i.e., a search page).

Traversing the graph with our query API returns the correct data, but it’s expensive. Deep graphs can take seconds to retrieve a results set, as opposed to tens of milliseconds when using the custom index.

We determined we could repurpose the custom index to retrieve a flat list of resources for an actor. Since consumers effectively flatten the graph, we called this API “Flatten.”

The Flatten API “flattens” the graph

The index uses special filters to reduce the results returned to the consumer. Consumers use the output of Flatten to query their database for resources to show to the user.

List lookups were much faster with the index, but slowed down when consumers used filters.

We built a trigram index on top of the custom index to support faster lookups with filters. All these optimizations reduced lookup times to 50ms for the 95th percentile.

Version 2.0.0

A constant “build, measure, learn” feedback loop enabled our team to deliver immediate value. New features drove adoption across the company. Teams gravitated towards AuthZ because it was performant and easy to use.

But it wasn’t enough—yet. People wanted to use the new system to query old permissions. To drive further adoption, we implemented a legacy permissions proxy.

The proxy enables teams to call AuthZ for legacy permissions instead of using the JWT token. Legacy permission checks are slower than AuthZ calls. But the proxy encouraged a single interface. It is much easier to migrate customers on a single interface than it is on two disparate systems.

Currently, AuthZ services seven different applications in our production environment for about 130 new, unique permission types. We have an average load rate of about 15 requests per second. Our metrics are growing quickly.

We’re working to replace the legacy permissions proxy with native AuthZ calls. Native AuthZ calls will make checks an order of magnitude faster on average than the proxy. The 95th percentile is more than two orders of magnitude faster.

AuthZ has made it easier for teams to manage permissions. Developers can build new products without having to build underlying authorization infrastructure.

The takeaway

When building something new — be it an internal service or external product — don’t be afraid to experiment. Our team had several failures until we built something that our consumers wanted to use. Ultimately, early release was key for us.

I challenge you to use this process in your own work. Is there a broken system at your company? Can you launch a simple experiment to help build something useful?

We’ll be covering the architecture of AuthZ in part two. Leave a comment if you have questions about AuthZ or our development process. And if you’d like to help us build the next version of Carta, we’re hiring.

--

--

Aaron Tainter
Building Carta

Software engineer (Carta, xEbay, xPaypal) and creator of the Youtube channel “Hacker Shack” https://www.youtube.com/c/hackershack