Solving a Set Cover Problem in Cloud IAM on GCP

Paul Durivage
Google Cloud - Community
7 min readMar 25, 2020

Recently, a fellow Strategic Cloud Engineer had an interesting question from a Google Cloud customer that sparked our collective interest. They had asked how they could get every possible permission on the Google Cloud Platform with the least amount of work. GCP has hundreds of roles and thousands of permissions. In order to answer this question, we’d have to look at every single one. This needed to be solved with code.

Please note that this is primarily a thought exercise. The obvious concern with granting every possible permission is that it violates best practices like separation of duty, which is to grant access by the principle of least privilege required to accomplish your user’s or team’s job. The scope of access is enormous and great care should be taken to prevent misuse or abuse. I highly recommend reading this blog post for more information about using IAM securely.

First, let’s level-set on some basic Google Cloud Platform (GCP) IAM concepts. Cloud IAM provides the framework by which access is granted to Google Cloud resources. In the simplest terms, an IAM policy grants a member access to perform specific actions on a cloud resource.

Working with resources in GCP, such as listing or creating Google Compute Engine instances, requires specific permissions: compute.instances.list or compute instances.create, respectively. Permissions generally correspond to a GCP API method.

Roles are bundles of permissions which are granted to members of an IAM policy. A member may be a user, service account, group, or domain. Google provides many predefined roles of various types–sometimes associated with a job function that a member might perform–like a Database Administrator, a data scientist who requires BigQuery read access, or a Security Engineer who oversees security controls across the platform. No single role contains all possible permissions, which number in the thousands and will continue to grow.

IAM policies are applied to various resources: organizations, folders, projects, instances, etc. A policy binds members to a role. Access is ultimately granted when a policy is set on a resource.

To solve the problem of granting every permission with the fewest possible roles, you can use Python 3 and prototype in Jupyter, as it’s a great way to iterate on an algorithm.

First, we need to get a comprehensive list of the available IAM roles and the permissions contained therein. In the code below, I use the Python client to the Google API to list and describe roles until the results are exhausted.

Enumerating GCP IAM roles by querying the API.

As roles and their permissions change over time, you may optionally dump the dictionary as a JSON file as an easy, human-readable cache for the results. You may potentially save hundreds of API calls as you iterate on your algorithm or you may choose to use the data for later analysis.

Let’s glance at the first few lines of our results. The role (i.e. roles/accessapproval.approver) is the key, and the value is the raw describe API call result. The array value of “includedPermissions” has exactly what we’re looking for.

GCP IAM role results as JSON, truncated.

At first glance the results may not appear to be very large–just 751K of duplicated data. There is a many-to-many relationship in the data, so in order to work through the results you need to perform some look-ups, like identifying which permissions are in a specific role, or discovering which roles contain a given permission. You’ll also want to identify what unique permissions are available so that it’s possible to produce a set of roles which provide them all.

Iterate over the data once and create a few dictionaries for each of these look-ups. While you’re iterating over the data, count and store how many permissions each role contains–it will be useful later. Keep track of the unique permissions by creating and populating a set. Not only is that a useful data structure, but you’ll need to know all the possible values of permissions so that you know when a satisfactory answer has been found.

Processing role data by mapping roles to their permissions and vice versa.

Sort the roles by the number of permissions they contain. This is useful in the next step and later in the selection algorithm.

Roles in GCP are often subsets of another role, meaning that the permissions they contain could be wholly contained in another, more comprehensive role. Eliminating subset roles before running the selection algorithm will increase the quality of the choices it makes.

Finding subset roles

The problem being solved is called a Set Cover Problem. There is a set comprising all possible unique permissions (called the universe) and a collection of sets of permissions (roles) whose union equals the universe. You just have to find which roles when added together equal all possible permissions.

The algorithm I used to find the set is called a greedy algorithm. The goal is to find a role containing the largest number of permissions first. We’ll remove the permissions this role satisfies from our set, then work on the remainder. In each iteration we look at one permission in the remaining set. We look to the set of roles which contain this permission, then sort this list by how many permissions each role contains. We always select the role which contains the largest number of permissions. This way, we have the biggest opportunity to remove permissions from the remaining set on each iteration. We keep track of each role we select, and when no more permissions remain, we know we have solved for every one.

Grab the role with the most permissions first.

That role ends up being owner with (currently) 2,578 permissions.

Jupyter screnshot showing the role with most permissions is “owner” with 2578 unique permissions.
Working with the results in Jupyter.

Create a set of the roles selected and initialize it with the owner role identified as containing the most permissions.

Next, calculate what permissions have yet to be satisfied by the owner role and the universe of unique permissions. Below, using the minus operator on the sets calculates the difference; the result is set in the remaining variable.

While there are still permissions that remain to be satisfied by our selected roles, the algorithm continues to try and find satisfying roles. Because it’s greedy, it always uses the role with the most permissions to satisfy a required permission. After each iteration, it recalculates which permissions remain and loops until none are left.

Selecting roles which satisfy the necessary remaining permissions.

The algorithm selected the following 16 roles, sorted alphabetically:

roles/axt.admin
roles/billing.admin
roles/billing.creator
roles/compute.xpnAdmin
roles/container.hostServiceAgentUser
roles/datacatalog.categoryFineGrainedReader
roles/datafusion.serviceAgent
roles/iam.serviceAccountTokenCreator
roles/iap.httpsResourceAccessor
roles/orgpolicy.policyAdmin
roles/owner
roles/remotebuildexecution.actionCacheWriter
roles/resourcemanager.folderAdmin
roles/resourcemanager.organizationAdmin
roles/resourcemanager.projectCreator
roles/serverless.serviceAgent

You may validate that the selected roles contain all permissions in the universe:

Screenshot from Jupyter notebook where the code/algorithm was prototyped confirming the results provide all permisisons
Confirming the selected roles contain all possible permissions.

You now have a set of roles which grant every unique permission in GCP. But is this the smallest possible set of roles?

No. Well, probably not.

The algorithm doesn’t select for the smallest possible set. And worse yet, the Set Cover Problem is NP-hard, meaning that you have to calculate all the possible role combinations in order to find out–or at least up to 15 roles, as you now know there is a solution with as few as 16 roles combined.

Using combinatorics to calculate how many combinations are required to potentially find a better answer–without eliminating any roles–gives us 1.187552 x 10²⁸ combinations. This is a huge number! By eliminating roles which are subsets from 485 to 39, the number of combinations is reduced to approximately 54 billion in order to find solutions of up to 15 roles–which is at least within the realm of possibility. Checking all the combinations will take a while, so I wrote an application to do just that.

To find an answer once and for all, I wrote a command line application which tries every possible combination of up to 15 choices of our 39, non-subset roles using a brute force approach. The app stops at 15 because we have a solution with 16 roles from the greedy algorithm. Any solution–if there even is one–would need to contain 15 or fewer.

The command line application operates in two steps. First, it produces every possible combination of roles and writes the combinations to disk. This step is dubbed the producer phase. Next, in the worker phase, a process is started for every CPU core on the machine, each evaluating a slice of the total combinations until none remain.

To provide the most powerful CPU resources to each process, I selected the compute-optimized instance type c2-standard-60, providing 60 vCPUs and, therefore, 60 workers to analyze combinations, fully saturating the compute resources of the instance.

A terminal window running htop, displaying full CPU utilization of the compute instance as it brute-forces an answer
Screenshot of htop displaying full CPU utilization of the compute instance as it brute-forces an answer.

Running the producer resulted in over 2 terabytes of data in just under 9 hours of total run time. Sixty workers crunched the combination data for over 28 hours, finding 8 solutions of 15 roles each. Here is one example:

roles/axt.admin
roles/billing.admin
roles/billing.creator
roles/compute.xpnAdmin
roles/container.hostServiceAgentUser
roles/datacatalog.categoryFineGrainedReader
roles/datafusion.serviceAgent
roles/iam.securityAdmin
roles/iam.serviceAccountTokenCreator
roles/iap.httpsResourceAccessor
roles/orgpolicy.policyAdmin
roles/owner
roles/remotebuildexecution.actionCacheWriter
roles/resourcemanager.folderAdmin
roles/resourcemanager.projectCreator

Despite not being technically correct, the greedy algorithm produces a good enough solution quickly (in hundreds of milliseconds), which I find to be a reasonable trade off when compared to an absolutely correct answer computed over a day and a half.

The code and data in this blog post is open source.

If you choose to use this algorithm to grant yourself new, all-powerful, all-encompassing IAM permissions, may you use them very carefully. This level of access is not appropriate for most organizations under most circumstances. Understand the risk and implement IAM conscientiously.

--

--

Paul Durivage
Google Cloud - Community

Sometimes I sit in silence. Sometimes I solve problems with 1s and 0s.