Our data engineering pipelines load data, transform it, and prepared it. All the prepared data is stored in one dataset in BigQuery hosted by one GCP project. Nothing fancy here!
We have 2 groups of users of this prepared data : data scientists, and data analysts. Each group belongs to a different business unit using their own GCP project, different from ours.
Although we are excited 🤩 😍 to provide this prepared data for these groups, we have to pay attention to 2 points :
- How to share the data with read only access
- How to make sure and be sure that no group will pay for the usage of the other group.
Sharing BigQuery data with read only access.
Although all the prepared data are stored in one dataset in the host project, we will not share all of it with anyone who asks for it. The needs of data analysts, for example, are different of those of data scientists. In fact data analysts are interested in categories and numbers ; so few columns of a table would be enough. Data scientists tend to request as much data as possible.
One solution to provide different “views” of the data, tailored only for a group needs, is … yeah you guessed it 👍, to create BigQuery Views from the source tables and share them with the proper group.
But careful ! We cannot share access to tables or views directly. As well stated by BigQuery Documentation “We share access to BigQuery tables and views using project- level IAM roles or dataset-level access controls. Currently, access controls are not directly applicable to tables or views.” And in order to be consistent with the principle of least privilege, we will use dataset-level access controls rather than project-level one.
Let me set the stage for an example, the data to share is hosted in a GCP project bq-data-host. For this example to be more thorough, this data will be scattered across, not 1 dataset but 2 : dataset_source_1 and dataset_source_2. The first will have a table austin_bikeshare_stations and the second will host a table named austin_bikeshare_trips. For a given requirement, we need to share with, data analysts group, some and not all the columns of the result of a join between the previous 2 tables : bq-data-host.dataset_source_2.austin_bikeshare_trips and bq-data-host.dataset_source_1.austin_bikeshare_stations. To implement this requirement we will :
- create a specific dataset to share with the data analysts group. I created one called : dataset_to_share_views.
- create a view called trips_with_stations_status_view based on the join query as explained previously
The screenshot below capture example I just explained.
The data analysts are members of another GCP project called bq-reader-diff-billing-account (please bear with me about the project name). We want to share with them access to bq-data-host.dataset_to_share_views dataset and its views. Since they are not members of bq-data-host GCP project, they cannot see it ; AND WE WANT IT TO STAY THIS WAY! And in no case shall they be able to see or query bq-data-host datasets that we did not share with them explicitly
It’s all about permissions
Google Cloud documentation is very well done. I will not hesitate to copy paste, of course with modification for our solution, from Google BigQuery documentation, more specifically from these two sections :
To share the dataset dataset_to_share_views with the data analysts :
- 1- Select dataset_to_share_views dataset from Resources, then click SHARE DATASET near the right side of the window as shown below
- 2- In the panel, in the Dataset permissions tab, click Add members.
- 3- In the Add members panel, type the email addresses of the data analysts or their group you want to add into the New members text box. In the screenshot below I added a member through his email mben******@****.com. He is a data analyst and lets call him Mehdi 😄
- 4- For Select a role, select BigQuery and choose the pre-defined IAM role BigQuery Data Viewer for the new members. For more information on the permissions assigned to each predefined BigQuery role, see the Roles section of the access control page.
- 5- Click Save and then click Done.
Pinning the host project
After granting Mehdi the role detailed previously, when he logs in his GCP console, his default project bq-reader-diff-billing-account will be selected automatically. When he navigates to BigQuery web UI he will not see bq-data-host project in Resources and of course he will not be able to see bq-data-host.dataset_to_share_views dataset that we just shared.
Even when he tries to find bq-data-host project in the drop-down on the top bar of the GCP console or by using its Search projects and folders textbox, he won’t be able to find it.
A solution could have been to give Mehdi the Browser or the Viewer role. This will give him the possibility to get and list bq-data-host project. But as we said before, we do not want, Mehdi, the data analyst to know any thing about bq-data-host apart from the datasets created in this project and shared with him.
The solution will be to pin the project manually : First Mehdi has to enter manually the following URL : https://console.cloud.google.com/bigquery?project=bq-data-host&page=project. bq-data-host project will show in Resources and no GCP project will be selected by default in the top bar. Then Mehdi has to click PIN PROJECT near the right side of the window as shown below :
Now, any time Mehdi accesses his BigQuery web UI, his default GCP project will be selected in the top bar and he will see bq-data-host project present in the Resources tree and can start querying it.
As you can see in the screenshot above Mehdi has his default project bq-reader-diff-billing-account selected in the top bar, he can now see the bq-data-host project on the left and he even can write a query on the view bq-data-host.dataset_to_share_views.trips_with_stations_status_view.
When he does, 💥 😮 he encounters the validation error below :
The query validator points out an error message:
Access Denied: BigQuery BigQuery: Permission denied for table: bq-data-host:dataset_source_1.austin_bikeshare_stations
Remember that the view queried by Mehdi is a join between bq-data-host:dataset_source_1.austin_bikeshare_stations and bq-data-host:dataset_source_2.austin_bikeshare_trips which are 2 tables hosted by 2 different datasets than the dataset hosting the view. The error message is clear : Mehdi is missing some specific permissions to be granted to him on dataset_source_1 dataset_source_2.
To identify that or those required permissions, let’s start by granting Mehdi BigQuery Metadata Viewer, the predefined BigQuery role, on dataset_source_1 and dataset_source_2 . As you can read in BigQuery access control documentation, this role has the least set of permissions for BigQuery with the permissions below () :
Now, Mehdi can see dataset_source_1 and dataset_source_2 in the Resources tree and their tables but he still cannot run his query on the view bq_data_host.dataset_to_share_views.trips_with_stations_status as you can see in the screenshot below
So let’s be more permissive and give Mehdi a little broader permissions by granting him BigQuery Data Viewer predefined role. It has 3 more permissions compared to the previous role as you can see below in the full list of permissions:
As you can see the screenshot below Mehdi can now query the view bq_data_host.dataset_to_share_views.trips_with_stations_status
But remember he has now BigQuery Data Viewer BigQuery predefined role on dataset_source_1 and dataset_source_2 which means that he can query directly the tables hosted by those datasets shared with him as you can see in the screenshot below :
But, as we stated at the beginning of this post : Mehdi, as data analyst, SHOULD NEVER BE AWARE OF HE SOURCE DATASETS NOR BE ABLE TO QUERY ITS TABLES: we are only sharing with him specific views through a specific dataset for his specific use case.
So we need to find the strict minimum set of permissions, that can fulfil our requirements without giving much.
Cloud IAM custom roles
Custom roles enable you to enforce the principle of least privilege, ensuring that the user and service accounts have only the permissions essential to performing their intended functions. For more details dont hesitate to read the documentation.
The idea is to identify the strict minimum set of permissions needed to make the view query run without making available its source datasets
Let’s think about it : BigQuery Data Viewer role has the permissions needed to make this happen ; BigQuery Metadata Viewer did not. So the required permissions should be among the 3 extra permissions belonging to the difference between BigQuery Data Viewer role BigQuery Metadata Viewer role and here they are :
In our case we are not dealing with models and we are not exporting data from tables, so bigquery.tables.getData permission looks like a good candidate.
So let’s start by creating a custom role, I called it CustomViewsAccessor 😜, and assigned to it this only permission. We create the role on bq-data-host GCP project because it’s the only project where we will be using this role. Otherwise I would have created it on the organisation level 😉 (please read the documentation for more details).
Now lets grant our newly created custom role to Mehdi on both source datasets : dataset_source_1 and dataset_source_2 as showed below
And now when we query our view bq-data-host.dataset_to_share_views.trips_with_stations_status_view, outside its hosting project, it runs without any problem. We are lucky because one permission and one try and things started working, in other cases I would have tried many permissions. More importantly, only the dataset that has been intended to be shared with Mehdi is appearing and there is no trace for the source datasets in the Resources tree on the left as you can see below.
As with my previous article “Capping costs on GCP”, the solution changed while I was writing the blog post 😅. At the beginning the solution was to grant BigQuery Data Viewer and BigQuery Job User roles on the source datasets, but as you saw in the post the solution has become more straightforward, I mean with smaller list of permissions (one permission to be precise) within custom role.
Now that you know how to “share views” with different groups with different needs, I bet you are asking who will pay, or to whom will you send the bill, when for example data analysts query data hosted by a project other than theirs. As a teaser : the answer is not as straightforward as you imagine 😄 So I will explain it in a following post with details and proofs 😉