Data Governance Using Aggregation Policies in Snowflake

Several years ago I spent some time with a Canadian financial institution working with housing price data. This data was provided to us through surveys of housing sales, and it was considered pretty confidential. We wanted to make the aggregate results of these surveys available to the general public, but faced a conundrum: in certain situations doing so might allow a savvy reader to infer the price of a particular house sale. This was due to the fact that in certain geographic areas the number of sales in a particular period of time might be very low, and someone living in this area might know, simply through observation, which houses had sold, and be able to reverse engineer the price of a particular house from our data.

Our policy was that if the number of sales in the lowest level of the geographical area was below a certain number, we could not provide an aggregate of the sales (in this case, an average) in that area. This was felt to be an adequate safeguard against inadvertently revealing the details of any particular house sale. But it was a little difficult to implement, because we needed to be sure that overall the aggregation at higher levels was correct, but was represented as null if the query was done at lower levels.

As part of its robust data governance features Snowflake provides an simple, out-of-the-box solution called Aggregation Policies which will help you solve many of these kinds of problems. The use case is: you want some users to see the full data set, but other users need to be restricted to higher levels of aggregation, and all users need to be confident that their aggregates are correct.

Note: Aggregation Policies are available in Enterprise Edition of Snowflake and above.

Conceptually, Aggregation Policies are pretty simple: they are just a security object that ensures a couple of things:

  1. That when the underlying object is queried, it must be done through an aggregation of some kind.
  2. That the aggregation is made up of a certain minimum number of rows.

(Additionally, these rules can then be extended to apply to entities within a table i.e. rows with common attributes attributes — more on this below).

Let’s start with a simple set of data representing sales within a geographic hierarchy:

In Canada, surveys are often broken down according to a hierarchy established by Statistics Canada, a department of the federal government. In this case I have used completely fictional data, including the Census Tracts and Dissemination Area codes. In reality a Census Tract is an area that contains about 2500 people, and a Dissemination Area is a sub-division of a Census Tract and contains about 400–700 people.

Our rule will state that if we are attempting to aggregate by 3 or fewer rows, we will not see the results for the rows that would be aggregated at this level. First, lets create the rule:

CREATE AGGREGATION POLICY housing_aggregation
AS () RETURNS AGGREGATION_CONSTRAINT ->
CASE WHEN CURRENT_ROLE() = 'AGGREGATION_RESTRICTED_DATA_ANALYST'
THEN AGGREGATION_CONSTRAINT(MIN_GROUP_SIZE => 3)
ELSE NO_AGGREGATION_CONSTRAINT()
END

This is a conditional aggregation policy. What it is saying is: if this user is in the role AGGREGATION_RESTRICTED_DATA_ANALYST, then constrain their results as follows:

  1. They must use aggregation when querying the data
  2. The results of the aggregation must use at least 3 rows

Otherwise (i.e. if the user is not in this role) no aggregation policy is applied.

So how does this actually behave? First, I change to a role called AGGREGATION_RESTRICTED_DATA_ANALYST, and just query the data:

USE ROLE AGGREGATION_RESTRICTED_DATA_ANALYST;

SELECT * FROM HOUSING_SURVEY_DATA;

Result:

Ok, lets try finding the average sale price, first by Census Tract:

SELECT CENSUS_METROPOLITAN_AREA
,CENSUS_TRACT
,AVG(SALE_PRICE)
FROM HOUSING_SURVEY_DATA
GROUP BY CENSUS_METROPOLITAN_AREA
,CENSUS_TRACT

We have no problem with this query, because each Census Tract has at least 3 rows. But what if I try to aggregate at the Dissemination Area?

SELECT CENSUS_METROPOLITAN_AREA
,CENSUS_TRACT
,DISSEMINATION_AREA
,AVG(SALE_PRICE)
FROM HOUSING_SURVEY_DATA
GROUP BY CENSUS_METROPOLITAN_AREA
,CENSUS_TRACT
,DISSEMINATION_AREA

Notice I get the average price for Dissemination Area 1001–01, 1002–01 and 1002–03, but then I get a general average for all the Dissemination Areas that have 3 or fewer rows, with nulls in the other columns.

Now lets change the rules a bit to explore the concept of adding an entity to the aggregation restriction.

An entity is just all the rows that are identified by a certain value in a column. In our example above, we can consider Dissemination Area as an entity.

So to change our rules a bit, let’s say we don’t want to just restrict the aggregation based on the number of rows, but we want to force the aggregation, at whatever level it is done, to include a certain minimum number of discrete entities. For example, we want everything that is aggregated to include at least 3 different Dissemination Areas.

Let’s look at our raw data again:

Notice that Census Tract 1001 has 4 different Dissemination Areas, spread over 6 rows. Census Tract 1002 has 2 different Dissemination Areas, also spread over 6 rows. According to our new logic, we want to be able to aggregate at the level of Census Tract 1001, because it contains both

  • 3 or more rows
  • and 3 or more different Dissemination Areas

…but we don’t want to be able to aggregate Census Tract 1002, because while it also contains 3 or more rows, these rows only break down into 2 different Dissemination Areas.

This is easily accomplished by adding a Entity qualifier to the ALTER TABLE statement that assigns the Aggregation Policy:

ALTER TABLE HOUSING_SURVEY_DATA SET AGGREGATION POLICY housing_aggregation
ENTITY KEY (DISSEMINATION_AREA);

We simply add an entry that identifies the Entity Key for this table, in this case the Dissemination Area.

Now when we aggregate at the Census Tract level, we get this:

We no longer see Census Tract 1002, even though it meets our first criteria of having more than 3 rows, because it does not have enough Dissemination Areas to be aggregated.

Aggregation Policies are another great tool in the increasingly versatile Snowflake toolkit for data governance.

--

--