Elephants standing performantly between some trees. Photo by Jeff Lemond on Unsplash

Designing the most performant Row Level Security schema in Postgres

In an effort to write less application logic in my services and interfaces, (and out of pure laziness), I am persistently looking for performant strategies to bake access control directly into my schemas. Row Level Security has been around in Postgres for a while now, but we have recently gotten some upgrades to the optimizer which make it worth a deeper investigation.

But why?

If you’re not familiar with Row Level Security, the main hypothesis is that we should be able to prevent access to specific rows of data based on a policy. That means our application logic only has to worry about SELECT * FROM my_table and RLS will handle the WHERE user_id = my_user_id part automagically.

To put it another way: our queries should only contain the clauses requested by our interfaces and not the filters and conditions demanded by access control in a multi-tenant data store. Take this example:

SELECT id, value
FROM items
WHERE created_at > now() - interval '1 week'
AND items.owner_id = $1

In the above example the WHERE clause is imposing conditions required for access control, (items.owner_id), and requested by an interface, (created_at). If our interface grows more complex so could the WHERE clause and, all of a sudden, separating concerns becomes much harder. Mistakes will be made, and we will keep writing the same annoying access control logic into our queries.

Enter Row Level Security, which provides a space for us to define the conditions demanded by access control separately from the conditions requested by the interface. All of a sudden, our queries look like this:

SELECT id, value
FROM items
WHERE created_at > now() - interval '1 week'

And, all of a sudden, then can grow as complex as needed without impacting the access control side of the problem.

Requirements

Since the goal is to produce something usable for real applications, let’s talk requirements:

  1. Items are inherently private
  2. Items can be shared with individuals
  3. Items can be shared with groups
  4. Items can be made public
  5. Read and write permissions can be handled separately

It should also go without saying that we want our policies to evaluate quickly for large datasets; otherwise, our user experience will suck and no one will like us. I’ll use a couple of three letter initialisms throughout this post:

I will also use EXPLAIN ANALYZE a lot, which I suggest reading up on if you’re not familiar with it.

Produce these results locally!

Everything to reproduce these tests is available on GitHub, but I’ll just copy the gists on here to keep things simple. Each of the schemas is generated with a dataset of 100 users who can write to 1000 items and read from a subset of the rest.

In general, you can test any of these schemas the following way:

BEGIN;
SET LOCAL jwt.claims.roles = ‘c94a743e-8ffd-42dc-a237-b1613e111f53,be95031e-76e0–401a-808c-5c7acde292de’;
-- this represents an example query from an interface
SELECT count(*) FROM items;
COMMIT;

Note that we are using jwt.claims.roles instead of an existing local variable, such as current_user, because those behave unexpectedly when querying views and we want this RLS strategy to work for everything, (tables and views). You can also just SELECT * FROM user_item_stats(), which grabs a random user ID as the role and executes the previously mentioned transaction.

In a real application, you would swap out the SELECT statement for whatever interesting query your user wants to run.

RLS Policy, ACL stored in a column

This is likely to be the most performant approach to RLS since we won’t need any SELECTs in our policy; permissions are stored directly on the items. The biggest drawback about storing an ACL in a column is that it will get very wide for items shared with many users/groups, and that may impact performance.

Item Table

We’ll use the following schema to define our table:

There are two key things which we’ve implemented in the table definition:

  1. Read and write permissions are handled differently. We could add an owners column if we needed that as well.
  2. We are using a GIN index for the ACL columns to ensure we maintain good performance on the arrays of permissions.

Item Policy

This policy is pretty straightforward: we are just using && to check if the claim roles overlap with the ACL. Note that we aren’t doing anything special for public items here; for this schema you’ll have to create a group to represent public instead of a special column as we will do for the next schema. Surprisingly, the planner uses a sequential scan at the deepest node just for adding an OR items.public = TRUE condition to the USING clause. That causes a huge performance hit:

ACL in column, 100 users, public attribute. Results here: https://explain.depesz.com/s/ivDo

Yuck, hopefully performance is better when we avoid the public logic and handle it like just another group.

Performance

As for all of these schemas, I’m testing it using a simple query: EXPLAIN ANALYZE SELECT count(*) FROM items. Running on my local machine this completes in under 1ms.

Look at all those index scans. Nice.

If we don’t need group-level access we can drop all the sketchy array-splitting logic and just set jwt.claims.roles to a single ID; regardless, this policy’s performance will not change. If we do want groups, we just need to make sure that jwt.claims.roles is generated securely in some sort of authorization service. Obviously, a compromised token generator may allow users to inject harmful logic in the local variable.

One of the best things about this implementation is also the worst: the IDs stored in the ACLs are not constrained as foreign keys. This means we can’t add ON DELETE constraints and the ACL column values may become orphaned. On the other hand, they can reference IDs from multiple tables, like groups and users, which adds a bit of flexibility. Double nice.

As a final test, let’s see what happens when we bump it up to 1000 users and a total of 1 million items:

I won’t make any claims about how performance will change as we add more users, but it safe to say this schema is very fast.

RLS Policy, ACL stored in a table

Using rows-in-tables instead of arrays-in-columns feels so much more…normal. The only question: does doing it the “right” way ruin performance? Probably…

Item and Permissions Table

For this schema we can take full advantage of a normalized database and foreign key all of the things. To take it one step further, this schema uses an enumerable value for the role of each permission. We could use that ENUM to add more permission types, (like owner), without impacting performance.

The big problem with REFERENCES is that we can only constrain the permissions table to one other table. Due to this we are forced to store users and groups in one table, which could get confusing. It probably already is confusing, so let me know and I’ll add a comment explaining myself.

Note that we’re also adding some indexes to this permissions table which should help the planner avoid sequential scans.

Object Policy

Here’s a massive policy:

Uh oh, we’ve got a SELECT in our policy. Those of us who need groups also have an ANY, it’s not looking good…save us, magical query planner!

Performance

Performance is certainly slower at a whopping 22ms for only 100 users, but it is still pretty fast:

The trade-off is that we get to explicitly denote an item as public, and can even add additional permissions roles like owner or admin which may be important for some use cases. Heck, we could even go as far as a number-based access permission, but who would ever need anything that granular?

Let’s check out what happens when we expand to 1000 users:

We’re still pushing pretty good performance considering how much logic is baked into that policy, but nowhere near as good as the column-based schema.

Security-barrier View, ACL stored in a table

Moving on, let’s use the same schema as the previous approach but tell the planner what’s up by using a VIEW that JOINs more performantly. Similar to the column-based ACL schema, we’ll have to use a group to implement the public feature.

Object View

No policy needed, we can just revoke access to the entire items table and voilà: the user won’t know the difference. Until they try to INSERT something into the view, then they’ll know the difference for sure. Also the name might give it away.

Performance

Surprisingly, the ANY isn’t the slowest node in this query, the JOIN forced a sequential scan for item_id = items.id. Maybe I’m getting tired or maybe it’s impossible, either way I’m disappointed by the performance of this schema.

Another huge drawback: this view is not automatically updatable, which means users will still need to access the underlying table. They can do that either through stored procedures or the same WITH CHECK-type policy that we used in the table implementation. While this may have been a more performant approach pre-pg@10, it definitely isn’t holding up well now.

That said, if you don’t need groups, dropping the ANY clause does yield significant improvements:

Next Steps

Both of the RLS schemas performed well and may be valuable tools for different use cases. The table-based approach avoids modifying existing tables, while the column-based approach if blazingly fast. In my mind, the only thing to do next is trying building something with both of them and see what pain points emerge.

If you have any optimizations for these schemas drop them in the comments, it would be great to push them to their limits. Additionally, if you’ve found alternative strategies which offer great performance via sane schemas I’d love to hear about them.