Sandboxing our client reports
--
How we use BigQuery’s table-valued functions to prevent data leakage
About Teads
Teads, the Global Media Platform, is a cloud-native advertising platform that helps Publishers monetize their editorial content with premium Advertiser demand.
Teads Ad Manager (TAM)
TAM allows advertisers to create ad campaigns. Using its reporting module, advertisers can create reports to monitor their campaigns.
Events centralization
A typical use case is a viewer who browses a publisher's site and sees an ad.
It generates metrics that we centralize in our data warehouse (which is hosted on Google Cloud Platform BigQuery).
Rollup generation
The events are centralized in a log on BigQuery. To ease querying them, they are then pre-aggregated in rollups.
For example, the following logs:
╭--------┬----------------╮
│ user │ ad │
├--------┼----------------┤
│ Alice │ red sneakers │
│ Bob │ smartphone │
│ Alice │ red sneakers │
│ Alice │ blue sneakers │
╰--------┴----------------╯
will be aggregated in rollups like this:
╭--------┬----------------┬--------╮
│ user │ ad │ views │
├--------┼----------------┼--------┤
│ Alice │ red sneakers │ 2 │
│ Bob │ smartphone │ 1 │
│ Alice │ blue sneakers │ 1 │
╰--------┴----------------┴--------╯
By aggregating like this we divide the number of daily lines by ~ 10.
Still, after this compression, we have billions of lines in one day.
Reporting service
Our reporting service queries rollups so that advertisers can get information about their campaign, for example, how many times their ads were viewed, using Reporting service.
For more info on reporting, see:
Reporting also joins this fact data with referential data about users and permissions to know what data the user can access.
The reporting service returns CSV / Excel reports to clients based on the rollup data in BigQuery.
Here is an example of what Alice’s report might look like :
╭---------------┬----------╮
│ ad │ views │
├---------------┼----------┤
│ red sneakers │ 42 042 │
├---------------┼----------┤
│ blue sneakers │ 43 043 │
╰---------------┴----------╯
Risk mitigation
We want to protect from cross-user data leakage.
In the example below, even in case of a bug on the reporting side, Alice should never be able to view Bob’s data in her reports.
Note that we have existing controls in place to avoid such data leakage (Unit/Integration Tests etc), this is about how we add another layer of security, which will:
- guarantee that a user access only data she is allowed to
- reflect permissions changes for the user in a matter of minutes
The obvious approach: creating Silos
Instead of having one big rollup table, each tenant could get its own table.
This approach has clear benefits in terms of isolation and propagating user identities in Bigquery is not mandatory.
However, its drawbacks lead us to exclude it because it would:
- double the storage costs as we basically fan out the full rollup table in per-tenant table with all the same content in it
- create thousands of tables that would need to be managed and refreshed
- introduce extra latency in the pipeline since each table would be built upon the existing rollup table. Moreover, cross-seat can be implemented fully on reporting service side or, preferably, using a joint table which would introduce further latencies
A cost-effective approach: Row-level security
As per documentation: https://cloud.google.com/bigquery/docs/row-level-security-intro:
«Row-level security lets you filter data and enables access to specific rows in a table based on qualifying user conditions.»
This is not possible because we’d need IAM groups / users in the order of 10 thousand, but a table can only have up to 100 row access policies.
Rollup data is read-only, but permission changes often, so we would have to update the roles in real-time.
Scaling row-level security: an overview
To implement the row level-security at our scale, we used:
- a dynamic permission mapping table
- façade SQL functions to access to the rollup through the permission mapping
- authorizations to forbid direct access to the rollup
Creating a permission mapping table
When Alice requests her data, we check in our database that she is authorized. Then we create a token for her in a permission mapping table. This token has a validity range, with a start timestamp and an end timestamp.
Table-valued Functions
In BigQuery:
A table function, also called a table-valued function (TVF), is a user-defined function that returns a table. You can use a table function anywhere that you can use a table. Table functions behave similarly to views, but a table function can take parameters.
https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions
Declaring the table-valued function
On BigQuery, we create a table-valued function that takes the token as a parameter and joins the data table (rollup) with the permission_mapping table.
Rollup_TVF(token)
:
select r.*
from
rollup as r
inner join permission_mapping as pm
on r.user = pm.user
and token = pm.token
and pm.start_ts >= current_timestamp()
and current_timestamp() <= pm.end_ts
For example, with :
Rollup
------
╭--------┬----------------┬--------╮
│ user │ ad │ views │
├--------┼----------------┼--------┤
│ Alice │ red sneakers │ 42 042 │
│ Bob │ smartphone │ 10 001 │
│ Alice │ blue sneakers │ 43 043 │
╰--------┴----------------┴--------╯
Permission Mapping
------------------
╭--------┬----------------┬----------┬--------╮
│ user │ token │ start_ts │ end_ts │
├--------┼----------------┼----------┼--------┤
│ Alice │ alices-token-8 │ 40 │ 60 │
│ Bob │ bobs-token-4 │ 10 │ 30 │
╰--------┴----------------┴----------┴--------╯
At timestamp 50
, running TVF(token = “alices-token-8”)
we will get:
Rollup_TVF(token = "alices-token-8")
------------------------------------
╭--------┬----------------┬--------╮
│ user │ ad │ views │
├--------┼----------------┼--------┤
│ Alice │ red sneakers │ 42 042 │
│ Alice │ blue sneakers │ 43 043 │
╰--------┴----------------┴--------╯
Because of the end_ts
, starting at timestamp=61
, the reporting service will get an empty answer.
Reporting service can then query this “parameterized view” with the columns/rows it actually needs, then do the data post-processing it needs to present the results Alice wants.
The TVF is set up via Terraform, so even in the case code has a regression, access is still protected.
TVF access
To limit what service-reporting can access, we use authorized functions:
Authorized functions let you share query results with particular users or groups without giving those users or groups access to the underlying tables
https://cloud.google.com/bigquery/docs/authorized-functions
The TVF has access to the rollup table and permission mapping table.
Reporting only has access to the TVF, but has no right to directly access rollups. That way we ensure reporting can only query relevant data for the current user.
Putting it all together
Here is the whole process when Alice asks for her report:
- reporting service asks for a token
- api-domains service checks the user and creates a token in the permission mapping table
- api-domains service returns the token to reporting service
- reporting service queries the rollup data it needs via the TVF
Performance measurement
While the solution described above aims at improving security, we got extra benefits regarding speed and cost of processing.
The TVF is pre-filtering with a clustered field (the user id) and we don’t need to join on many tables to do the authorization checks.
This reduces significantly the total number of bytes our queries scan.
For example, on the test report below we divide the total bytes processed by BigQuery by more than two:
Conclusion
Due to its success, we will continue to expand further this pattern to all reporting features of our products.
Thanks to Rémy Saissy, Quentin Fernandez, Laura Diment, Matthias Kunter, Alban Perrillat Merceroz for their help in working on this feature and writing this article.
Thanks to Lucas Pamato for making the illustrations.