How Privacy Killed RBAC
The world is always changing, as are our laws and views. Back at the turn of the century immediately following the invention of the digital photograph, it was rather jarring to see your picture in the newspaper. “Instantaneous photographs and newspaper enterprise have invaded the sacred precincts of private and domestic life.” — Samuel Warren and Louis Brandeis (U.S. Supreme Court Justice). Which led them to propose (legally) the right to “be let alone”. Today we generally accept being observed, but rarely accept being identified — this in combination with the realization of how our personal data is being collected and used has led to regulations such as General Data Protection Regulation (GDPR) as well as the California Consumer Protection Act (CCPA), among many. And in fact, our recent COVID-19 crisis has driven people to think about how to enforce privacy controls for goals like contact-tracing to enable rapid and preemptive quarantine.
This is a short story about how those pressures from the real world can have a grave impact on our world of technology. In this case, how it is turning access control models on their head. You probably have run into this yourself when administering policies in Apache Ranger, Table ACLs in Databricks, AWS IAM roles, or Snowflake Roles — and wondered — why is this so hard? Let’s explain why…
First, a quick technology introduction.
Role-Based Access Control (RBAC) is a model for managing access to objects, and most typically, tables, columns, and cells (you may also hear Table Access Control List, ACL). As an administrator managing access through RBAC you implicitly predetermine what the users will have access to by adding them to a role. Then, you explicitly determine the privilege associated with each role. Simplifying a bit: you are deciding who belongs in an arbitrary “thing”, in this case the role, then you must decide what that “thing” has access to. Almost all modern (and legacy) databases approach access control through the RBAC model, primarily because it’s a simplistic abstraction to avoid manually granting access to individual users one-at-a-time — a simple shortcut, if you will. In fact, open source access control frameworks, such as Apache Ranger and Sentry also follow the RBAC model.
The key words for RBAC are implicit and predetermine.
In contrast is something called Attribute-Based Access Control (ABAC). The key difference with ABAC is that complex, and completely separate, rules can evaluate many different attributes. As described by the NIST Guide to Attribute Based Access Control (ABAC) Definition and Considerations: At access-request-time, “the ABAC engine can make an access control decision based on the assigned attributes of the requester, the assigned attributes of the object, environment conditions, and a set of policies that are specified in terms of those attributes and conditions. Under this arrangement policies can be created and managed without direct reference to potentially numerous users and objects, and users and objects can be provisioned without reference to policy.”
Simplifying ABAC: you define your users, define your objects, and define your rules, all independently, and let those rules make access control decisions at request-time.
The key words for ABAC are assign and decision.
Let’s break this down in a matrix:
To understand this chart better, let’s start with a simple (real) example from one of our customers:
By default, everyone can see all rows > 90days old.
Insiders can see all rows >90days old + fresh data (<90days old), but only for their region.
For argument sake, let’s say there are four regions, North, South, East, and West. With RBAC, you would need the following Roles (I’ll explain why momentarily):
- North
- South
- East
- West
- North Insiders
- South Insiders
- East Insiders
- West Insiders
- ..and actually many more, which I’ll also explain momentarily
For ABAC, you’d need to assign attributes to your users (remember, these are being assigned to the users based on who they actually are, not a role with implicit access to data):
- Insiders and/or
- Region: maybe North, and/or South, and/or East, and/or West
The big difference is what we spoke about before with “implicit and predetermine” for RBAC and “assign and decision” for ABAC. Because, with RBAC, you need to explicitly predetermine the access for each possible role combination. Let’s do this with Apache Ranger to demonstrate.
Five policies…for now. There’s a few issues we haven’t addressed, though.
You’re probably wondering why we need roles called “role_north_insider” and “role_south_insider”, etc, instead of having individual roles for each: “role_north”, “role_south”, “role_east”, “role_west” and lastly, “role_insider”. This is because the policy assumes an “OR”, meaning, as long as you meet one of the role conditions in the policy, it will apply to you. For example, considering the final Ranger rule in the screenshot, which applies to people with role_north OR role_south OR role_east OR role_west, it will only need one match for the policy to be triggered against them — this is fine, because as long as they don’t have the Insider role, it’s the same policy no matter what region. However, if they do have Insider, as we know, the policy is different depending on the region, which requires us to create a predetermined unique role for each combination to map to the appropriate policy.
So while OR’ing the roles saves us time on the non-insiders scenario (the final policy in the Ranger screen shot), it leads to “role explosion” because of the remainder of the policy. What if I had someone that needed access to both regions north and south and was also an insider? You guessed it, we would need to create another role for that: “role_north_south_insiders”. If you wanted to cover every combination of roles possible for this very simplistic policy, you would need to predetermine 19 different roles and their associated policies!
Now someone asks you — “Hey, Nancy started last week, we need to get her access to our data. Nancy is an Insider and in North and South.” Do you add her to “role_insiders”, “role_north”, “role_north_insiders”, “role_north_south_insiders”? That may seem like a simple question, but it’s only simple because you know what the underlying policy for access is tied to those roles — if you didn’t know that — how would you know what you need to add Nancy to? Remember, there could be hundreds of policies to sift through, many of them not created by you. Beyond role explosion, that’s the other problem, it’s impossible to implicitly understand what access you are giving Nancy when you add her to a role. What’s worse is this means your data users have no idea what roles to request access to — if Nancy was asking for access, what would she say? We’ve seen custom tools built whose sole purpose is to help users figure out what roles they should request access to in order to get access to data — and we haven’t even gotten to the hard stuff yet!
With ABAC, you would objectively assign the attributes to the user (their Region, if they are an Insider) and separately would have built a single rule like this:
- Show the row of data If { Insider AND (> 90 days old OR Region IN(user’s Region)) } Else { > 90 days old }
There’s no need for 19 different roles, and if a user had both north and south regions, they would see data for both regions under 90 days (because both regions would be in the IN statement).
As you can see, with ABAC, you’ve simply assigned the user the attribute they belong to, it is just that, an attribute, plain and simple, there is no under-the-covers access decision implicit to that assignment, we are not conflating WHO the user is with WHAT they have access to like we do with a role. This is because the rule is kept separate from the assignment and makes a decision at query time for access.
Also consider when the organization decides to add a new region — with ABAC you do not need to change the policy at all, you simply start assigning users to that region. With RBAC you’d need to predetermine a whole set of new roles and policies every time this happens. And worse, if you forget to do this, you will leak data.
In Immuta, an ABAC engine for policy authoring, decision, and enforcement for any database, would represent that policy like below; notice it’s using logical metadata (@columnsTagged) on where to place the policy, not literal column and table names, and variables (@authorizations) to grab the user’s Region at query time. This provides a level of scalability well beyond Ranger/RBAC.
Now let’s talk about our current world of privacy and how it’s exacerbating this problem.
That rule above was more of a business rule, but if you start to think about CCPA, for example, and privacy by design, you need to start considering much more granular controls. You will need to factor in access decisions on direct identifiers in your data, such as names, and credit card numbers, but you also must consider indirect identifiers such as date of birth and zip code to truly anonymize / psuedonomize.
The masking techniques you will need to use across those different types will and should vary based on the level of utility you need vs the level of privacy you want to maintain — which means you will need many different lenses into a single table. Protections that lend some level of utility from a column while also preserving privacy, critical for analytical use cases, are commonly termed Privacy Enhancing Technologies (PETs) and can be very complex to implement. For example, we can further expand our original policy in our new privacy world:
By default, everyone can see all rows > 90days old.
Insiders can see all rows >90days old + fresh data (<90days old), but only for their region.
Additionally (this is the new part), all direct identifiers are masked using hashing unless you are doing human resources or billing work. All indirect identifiers are open to insiders, but are masked using k-anonymization for non-Insiders.
This is not a stretch in our CCPA, GDPR, etc world, in fact, it will be the norm. There’s no way you can do this with RBAC, it just doesn’t scale, and this is a pretty simple example. I’m sure you’ve seen this yourself when attempting to manage policies in Apache Ranger, IAM Roles on AWS, Snowflake Roles that manage both your table access and warehouse access, or Databricks Table ACLs — your head starts spinning — this is what we call “Role Explosion” and/or “IAM hell”.
With ABAC it remains simple, because we can assign those attributes (objectively, not implicitly) to the users and build decision logic separately, and in fact, the users’ “environments” can change based on what they are doing (human resources or billing) on-the-fly and have that reflected in their access.
With hierarchical RBAC (all roles you possess are rolled up together) “environment switching” is impossible because you can’t switch in and out of roles (not without involving an administrator, at least). One could ask the obvious question: couldn’t you have “human resources” as a role? Sure, but what if you as an individual have the “human resources” AND “billing” roles — how do I, as an organization, know which you are acting under and how to audit?
It is critical to both control and audit under what purpose you are processing your data subjects’ data. For example, under CCPA (and every privacy regulation on the planet has a condition similar to this): ““Business purpose” means the use of personal information for the business’s or a service provider’s operational purposes, or other notified purposes, provided that the use of personal information shall be reasonably necessary and proportionate to achieve the operational purpose for which the personal information was collected or processed or for another operational purpose that is compatible with the context in which the personal information was collected”.
In other words, privacy is no longer about controlling how your data is collected — that isn’t going to stop — it’s about protecting how your data is being used. Regardless, even if we ignore the privacy audit requirement, this policy is already too complex for RBAC, not to mention Ranger doesn’t have any concept of PETs such as k-anonymization as a column-level control.
With flat RBAC (you can only act under one role at a time) you could certainly have separate “human resources” and “billing” roles, and act under them individually — but then you won’t see the data you need because you don’t also have “Insider”. This scenario is even worse, because now you must consider all combinations of policies as a flat role representation, as we showed with our Ranger example earlier — can you hear the role explosion boom!?
Without further ado, here’s the ABAC policy representation:
- Show the row of data If { Insider AND (> 90 days old OR Region IN(user’s Region)) } Else { > 90 days old }
- Mask direct identifiers using hashing unless acting under purpose human resources OR billing
- Mask indirect identifiers using k-anonymization for everyone except Insiders
Voila! Also notice it resembles very closely to the plain English version of this policy written above.
If I’m an Insider in the North region doing human resources work, I’d see all the North region data.
If I’m not an Insider doing marketing, I only see rows older than 90 days, direct identifiers are hashed, and indirect identifiers are k-anonymized — at query time!
This is what that rule would look like in Immuta, a plain English, understandable policy.
Picture for a moment you’ve done the following in Immuta:
- Assigned appropriate attributes to your users (remember, you are defining them, not defining what they have access to). With Immuta, those attributes can be pulled from any and/or many systems across your organization (imagine pulling training courses completed from Workday, for example).
- Created projects that represent the different purposes users would be working under, for example, human resources and billing. Projects can have access controls as well — based on ABAC and signed acknowledgements. (note this step is not required)
- Created the above rule the scalable “ABAC way” to make policy decisions at query time. Remember, to further add scalability, you can create these rules at the logical level rather than the physical table-by-table, column-by-column level, using metadata.
Now picture your users executing Spark jobs in their Databricks notebooks and having all that enforcement happening on the fly without ever having to manage a role or table ACL. Or having your users acting under the PUBLIC Snowflake role running queries and having all enforcement happen dynamically — they only use Snowflake roles to control warehouse access. It’s a beautiful thing.
Let’s go back to our chart.
In our new world of privacy with many highly complex lenses into your data, any advantage you had with RBAC is now gone, and in fact, the “No”s are exasperated.
Flexibility: There is no such thing as a small or medium sized organization anymore, this is because it’s not about how many employees you have, it’s about how many policies you have. The policies are now so complex because of your ethical and legal privacy obligations, even the smallest companies have role explosion.
Simplicity: It is not “easy to start” anymore because there’s nothing simple about your policies, which lands you in a role explosion situation out of the gates.
Simple rules: See Flexibility.
Customizing permissions: Again, there is so much complexity, you must separate the WHO (the user attributes) from the WHAT (where they have access). Without that it’s impossible to implicitly understand what you are giving someone access to when adding them to roles. You quite literally cannot keep creating roles!
You may not have realized it until now, but Privacy is what killed RBAC.
Immuta saw this coming 5 years ago and built a platform to allow the enforcement of advanced privacy enhancing technologies and access controls using the ABAC model on any database or compute you choose, such as Databricks, Snowflake, EMR, Hive, Athena/Presto/Starburst/Synapse, Redshift — you name it. You no longer have to have your head (or roles) explode. Get your access control model under control and welcome to the new world order of ABAC — brought to you by Privacy.
RIP RBAC.