Enterprise Data Warehouse
Lessons learnt manipulating BigQuery export schemas in Snowflake
Balancing schema flexibility and ease of use within your data warehouse
Google Analytics dominates* the market share of web analytics tools. Similarly, if you’re a multi-channel business, you’re likely to be using Google’s Firebase SDK to capture event data on mobile devices.
That aside, BigQuery itself does not have such an influence over the data warehousing market.
Estimates online suggest they control somewhere between 3 and 12.5% of market share, with other major players such as Snowflake, Redshift, Apache Hive and SAP Business Warehouse picking up a piece of the pie.
Hold up, I’m new to this, what’s the relationship between Google Analytics, Firebase and BigQuery?
- Google Analytics is a means for capturing and analysing user-interaction data on the web
- Firebase is a Google product that provides a means for capturing event data and user properties from mobile apps
- BigQuery is Google’s fully-managed, serverless data warehouse. Both Google Analytics and Firebase provide tools to configure exports into BigQuery, so you can analyse your data using BigQuery SQL
So where are we going with this? Well, it’s not uncommon for a business to want to land their Google Analytics or Firebase data in a warehouse outside Google Cloud Platform. At DAZN, we utilize Snowflake and want to land all of our first and third-party datasets in this location. Ultimately, having all of your data stored in one place will help the data community make better decisions.
If this sounds familiar, at some point you will have to consider the destination schema for migrating that data outside GCP. If you’re familiar with the export schemas, perhaps the biggest question is how to handle those nested and repeated fields. Should you align with the default BigQuery export schema? Leaving it to your users to navigate the complex data structures. Or, could you take a different approach to help improve consumer productivity?
In this article, I’m going to explore some of the mistakes we’ve made in the past, and where you can perhaps find a good equilibrium that balances schema flexibility and ease of use for downstream analysts and data-scientists.
Consider Firebase, what is the structure of a raw event in BigQuery?
- Each row corresponds to a single event captured by the Analytics SDK (e.g. a screen-view event, or a user interaction)
- Each record consists of a dynamic set of event parameters and user properties, represented as a complex data type (array of structs)
- Event params / User properties may differ depending on the event type, and are likely to change or update frequently as new analytics events are added to the product
- The advantage of these complex data types, is the inherent flexibility that they bring. Specifically, the schema remains consistent, regardless of any new parameters that are included in events
It sounds like Google have it sorted, what seems to be the problem?
Although the schema is great for capturing and storing data, it requires a little more wrangling when it comes to querying the data. Common analytics use cases will require you to filter events on the dynamic set of key/value pairs. For example, specific user properties or event parameters.
In BigQuery, you can use the UNNEST function to convert an array into a set of rows (a process commonly known as flattening). This explodes the array vertically, producing a row for each element in the array.
Similarly, if you are a Snowflake customer, you could achieve the same result via the FLATTEN function. Once you execute this, it becomes more simple to filter on a specific key within the event (for example, the ga_session_id or firebase_screen).
This is all good, however, things start to look a little crazy when you want to filter on multiple event criteria…
WHERE KEY = 'ga_session_id') AS ga_session_id,
WHERE KEY = 'firebase_screen_class') AS firebase_screen_class
WHERE event_timestamp = 1631068354343000
AND event_name = 'screen_view'
There is nothing fundamentally wrong with keeping the data in its raw format. In fact, it probably makes sense if are only going to land the data in your lake. As demonstrated, we can still extract the same insight, but the query syntax does feel overly complicated.
More importantly, if you’re going to land this data in your warehouse, you need to consider the business implications of keeping the schema in its default format:
- Who will be using this data? For example, can I expect a Product person with intermediate SQL skills to query this? What is the level of competency across my data/analytics organisation?
- Which tools will process the data? Are my visualisation tools (e.g. Looker, Tableau) able to extract and process values from nested data structures?
- What is the impact on compute? For example, if my data volume grows very large, will the number of analysts regularly flattening large time periods have a significant $ cost implication for my org?
- How can I govern the attributes being loaded? To support data discovery, we should be capturing metadata for all nested fields. But, how do we display that information to users, if 50 event parameters are being passed into a single column in our destination schema?
It’s these specific questions that lead us to our first experiment with modifying the default schema structure…
Going nuclear, a fully flattened approach…
The TL;DR version of this is that it’s not a good idea. It certainly has some benefits, but as we found out, is likely to become a nuisance for your data engineering team to manage operationally.
Before we get into the reasons why, let’s explore what this approach actually looks like…
The principle is that all properties within repeated structures are exploded horizontally. Additionally, in the case of GA [Universal Analytics], which is aggregated to a session-level, you may also want to explode your sessions vertically, so that you can conduct hit (or event) level analysis more easily.
In the case of Google Analytics [UA*], this means you end up with a table containing n columns. Where n represents the total number of indexes you have across Session Custom Dimensions, Hit Custom Dimensions, and your Hit Custom Metrics. The resultant dataset is likely to be very wide.
Similarly, if you take the same approach with Firebase data, you will end up with n times the number of elements across both user properties and event params.
There are a few benefits of this approach:
- Governance: Each time a new dynamic property is introduced, it requires an update to the destination schema. This process ensures metadata is captured within your data catalogue upfront, and makes it easy to audit when that attribute was introduced.
- Computation: Flattening is completed only once by the data engineering team responsible for the ingestion of data into the warehouse.
- Ease of use: Downstream consumers needn’t worry about the underlying, complex data structure, and can focus on extracting insight from a columnar table that will be generally performant.
But, there are some serious downsides, which became apparent after only a short while:
- Operational overhead: If you’re a fast-paced org, new analytics events will be added to the Product weekly. Each time this happens, it could require an update to both your schema, and the DML that maps the source attribute to destination. Unless your data engineering team have few other priorities, this is generally unmanageable at scale.
- Table bloat: Within just 12 months our tables consisted of hundreds of columns. As a consumer, this isn’t a great user experience. New columns will be appended to the end of your table, which reduces the ordinal coherence.
We quickly became a bottleneck for the business, unable to keep up with the rate of tracking implementation (which, spans many different engineering teams, working on different areas of the product).
With that in mind, we went back to the drawing board..
To the rescue… OBJECT_AGG() & downstream data products
Many businesses will be loading user behavioural data into a raw layer, with a view to building downstream data products that aggregate data from multiple sources. Tools like dbt (data build tool) are becoming increasingly popular for performing the ‘T’ in ELT (Extract, Load, Transform), once your raw data has landed.
With that in mind, the most important fundamental we have learnt at this point, is that raw data coming into the warehouse must be fresh.
What we mean by fresh, is that when a new dynamic property is added to an event in GA/Firebase, there should be zero lead time to that attribute being available in your raw layer. That means, regardless of whether you’re able to address any updates to transformation pipelines immediately, you’re not reducing visibility within the raw data. Perhaps Google did have it worked out… time to revert back to a schema that aligns with the source!
However, instead of proposing a purely ELT approach, there is perhaps one minor transformation that you can make when loading the data into your warehouse. This will improve the ease of querying in the raw layer, whichw will benefit ad-hoc analysis, and won’t impact the schema’s flexibility.
Here is how Snowflake describes the function:
Returns one OBJECT per group. For each (
value) input pair, where
keymust be a VARCHAR and
valuemust be a VARIANT, the resulting OBJECT contains a
Through the combination of this elegant function and LATERAL FLATTEN, we’re able to manipulate those nested and repeat fields into a new structure that can be easily traversed using dot notation. The best part, is that consumers needn’t worry about having to flatten the object themselves.
Awesome! At this stage, I hope this nugget of information is enough to have made the read worthwhile. As you can see, we’ve rid of the outer array, making it super simple to filter on any of the dynamic key/value pairs. Additionally, we’ve removed the redundant data types that appear within the BigQuery structs.
The result is a cleansed yet flexible object, whereby multiple keys can be queried with very basic syntax. Let’s revisit that example from earlier, and assume we’re using Snowflake to query our output…
WHERE "event_timestamp" = 1631068354343000
AND "event_name" = 'screen_view'
Now that looks a little easier to follow! Plus, our destination schema in Snowflake will remain consistent, regardless of any new dynamic properties added to analytics events. Most importantly, we have lowered the bar for querying the raw data for the entire data community.
To summarize, how has OBJECT_AGG() helped us?
- It solves the schema flexibility issue
- It solves the computation issue
- It solves the ease of use issue
- It does NOT solve the governance issue!
Regarding that last point, there is no denying that the fully flattened approach had the edge in regards to governance. However, the three issues we have solved probably outweigh this loss.
Tackling the governance aspect is something for another post. For now, I think I’ve rambled enough!
There is another great post written here, which details flattening the BigQuery export schema into multiple, relational tables. If you found this post interesting, I encourage you to check this out too!