Exploring Compound Value Dimensions with Snowflake and Looker

Brendan Frick

Published in

GumGum Tech Blog

8 min readApr 29, 2022

An Article By Ying Liu and Brendan Frick

Indiana Jones with the compound value Golden Idol

Background

Dimensions and Explores

When modeling data in Looker, Tableau, and similar BI tools, we think of our data in terms of measures and dimensions. Measures are quantitative fields that can be aggregated — views, clicks, number of videos fully completed, etc. Dimensions are qualitative fields that can be used to categorize groups — timestamp, page url, country, etc.

Single value dimensions

For most of our dimensions, the features have a one-to-many relationship that makes data modeling EASY. One-to-many means each row in our table has exactly one value for that dimension, but that value can exist many times in that table. We can easily represent one-to-many features as single value dimensions.

Example of one to many relationship using data we have at GumGum with the amount of impressions for a video url

This relationship is intuitive and efficient to model in both SQL and BI tools. If a user wants to show a measure (# of impressions) by a dimension (video url), we only need to expose these as default objects, and the user will be able to natively select the set they want to use.

-- get the number of impressions (measure) per video_url (dimension)
SELECT video_url, SUM(num_impressions) -- aggregate measure
FROM my_table
GROUP BY video_url -- group by dimension

Compound value dimensions

However, there are some features that are much more complex to model because they have a many-to-many relationship.

In a many-to-many relationship, each row could have many values for a field, and that value can occur in many rows. There are many approaches to modeling this type of data, but one we have become fond of is using compound values (ARRAY, JSON, and other semi-structured data types) within the row.

Example data for many to many relationship data with compound values

This relationship is much more difficult to explore and report on because the many-to-many representation violates a lot of identity principles AND more advanced SQL implementations are required to transform the data.

SELECT ?, SUM(num_impressions)
FROM my_table
GROUP BY ?

The Task — make Compound Value Dimension Reporting Easier

We have been using compound values in Snowflake for several years, and had success for efficiently modeling our data. However, exploring and reporting on these dimensions was still clunky, costly, and confusing.

In our previous implementation, we persist an exploded version of the data for each representation. That meant replicating each row n times (where n is the number of values in the compound dimension), storing it in a table, and building a reporting layer in Looker for that representation.

Example data to demonstrate an exploded compound value dimension

Some of the major pain points of this model include:

Data Engineering has more tables to manage
There is more data to process and store
Stakeholders lose flexibility in exploring data
Stakeholders need to decide they want a data representation before they get to see it

We tasked ourselves with developing a dynamic, generalizable, intuitive, and efficient Looker/Snowflake approach that could represent the values in compound value dimensions in reporting and data exploration. In this solution we also want to leverage the power of semi-structured data-types in Snowflake.

Reporting Implementation

In this demonstration, we will show the full path from modeling the data in Snowflake, querying the compound value dimensions as single value properties, formalizing the approach into a Looker explore, and answering some common reporting questions in Looker.

Staging semi-structured data in Snowflake

For this demo, we create a table containing some fake data with semi-structured data types such as variants and arrays.

create or replace table video_contextual_data as 
    select
        column1 as video_url ,
        parse_json(column2) as sentiments ,
        split(column3, '|') as iab_v2_categories ,
        split(column4, '|') as keywords ,
        column5 as counts 
    from
    values
        ('https://fakedata1.mp4' ,
        '{"POSITIVE":0.11,"NEGATIVE":0.07,"NEUTRAL":0.82}' ,
        '386|388|379|380|383' ,
        'aa|cc|bb|gg|dd' ,
        300),
        ('https://fakedata2.mp4' ,
        '{"POSITIVE":0.9,"NEGATIVE":0.05,"NEUTRAL":0.05}' ,
        '48|42' ,
        'bagel|way|morning|life' ,
        100),
        ('https://fakedata3.mp4' ,
        '{"POSITIVE":0.5,"NEGATIVE":0.2,"NEUTRAL":0.3}' ,
        '640|641|642' ,
        'milk|soymilk' ,
        50),
        ('https://fakedata4.mp4' ,
        '{"POSITIVE":0.3,"NEGATIVE":0.6,"NEUTRAL":0.1}' ,
        '198|168' ,
        'hotpot|dumpling|rice' ,
        88),
        ('https://fakedata5.mp4' ,
        '{"POSITIVE":0.1,"NEGATIVE":0.05,"NEUTRAL":0.85}' ,
        '453|459|460' ,
        'baozi|yyds|burger' ,
        299);

When we query video_contextual_data, we see a couple of ARRAY columns and a JSON-like column.

select * from VIDEO_CONTEXTUAL_DATA;

Querying semi-structured data in Snowflake

In order to effectively explore the data, we need to utilize LATERAL FLATTEN statements. This allows us to explode the data out of ARRAY types and into rows.

For example — in the below query we are flattening the iab_vs_category and the keywordinto new rows, and keep the total_counts and positive_score in-line.

select
        iabv2.value::varchar as iab_v2_category ,
        keywords.value::varchar as keyword ,
        sentiments:POSITIVE::float as positive_score ,
        sum(counts) as total_counts 
    from
        VIDEO_CONTEXTUAL_DATA v,
        lateral flatten(input => v.iab_v2_categories) iabv2,
        lateral flatten(input => v.keywords) keywords 
    where
        positive_score > 0.5 
    group by
        1, 2, 3 
    order by 1;

Applying LATERAL FLATTEN on arrays or variants is pretty straightforward for Snowflake users. But for stakeholders who just want to focus on business insights, there are a lot of complexities involved.

With some simple LookML magic, stakeholders can explode those columns on the fly with a simple click without knowing all of the trouble underneath the iceberg.

Dynamic Lateral Flatten Implementation in Looker

We will be referencing a few LookML concepts in this section, including explore, view and native derived sql.

The key principle of this implementation is treating the LATERAL FLATTEN statement as another view which can be lazily joined back to the original view containing unexploded columns.

In order to perform the LATERAL FLATTEN dynamically, we are going to use native derived sql to define this explore. We do this so we can use the lazy join feature — meaning the LATERAL FLATTEN only happens when it is required when the compound value dimension is referenced.

When stakeholders choose an explodable compound value dimension, the LookerML intelligently explodes this value into separate rows. This allows Looker to model the individual values in the compound-value dimension as single-value dimension for reporting purposes. Because this happens prior to any real aggregation in the SQL engine, the handling and aggregations of measures can be handled as usual.

video_context.view.lkml

view: video_contextual_data {
  sql_table_name: VIDEO_CONTEXTUAL_DATA ;;dimension: video_url {
    type: string
    sql: ${TABLE}."VIDEO_URL" ;;
}dimension: sentiments {
    type: string
    sql: ${TABLE}."SENTIMENTS" ;;
}# get from variant column sentiments
  dimension: positive {
    type: number
    sql: ${TABLE}.sentiments:POSITIVE::float ;;
}# get from variant column sentiments
  dimension: negative {
    type: number
    sql: ${TABLE}.sentiments:NEGATIVE::float ;;
}# get from variant column sentiments
  dimension: neutral {
    type: number
    sql: ${TABLE}.sentiments:NEUTRAL::float ;;
}# original format of iab_v2_categories
  dimension: iab_v2_categories {
    type: string
    sql: ${TABLE}."IAB_V2_CATEGORIES" ;;
}# original format of keywords
  dimension: keywords {
    type: string
    sql: ${TABLE}."KEYWORDS" ;;
}
  measure: total_counts {
    type: sum
    sql: ${TABLE}."COUNTS" ;;
}
}# place holder view for lateral flatten sql to explode iab_v2_categories column
view: iab_v2_categories {
  dimension: iab_v2_category {
    type: string
    # explicit cast to string otherwise the value will be double quoted
    sql: ${TABLE}.value :: string;;
}
}# place holder view for lateral flatten sql to explode keywords column
view: keywords {
  dimension: keyword {
    type: string
    # explicit cast to string otherwise the value will be double quoted
    sql: ${TABLE}.value :: string;;
}
}explore: video_contextual_data {
  join: keywords {
    view_label: "Video Contextual Data"
    sql: ,lateral flatten(input => video_contextual_data.keywords) keywords ;;
    relationship: one_to_one
}
  join: iab_v2_categories{
    view_label: "Video Contextual Data"
    sql: ,lateral flatten(input => video_contextual_data.iab_v2_categories) iab_v2_categories ;;
    relationship: one_to_one
}
}

As you can see in the explore we can explode a compound value column into single value columns that can then be used in reporting.

Using the Explore

With this explore, users can freely peruse exploded versions of the compound value dimensions, without having to pay any special attention to the fact that the original dataset is stored as a compound value dimension. Looker users can easily ask their questions without knowing any of the complexities underlying the data.

Show me the total impression counts per video category? In this request, the explore explodes only categories and sums the counts by category.

We need to see the total video count — break it out by content category and keyword? In this request, the explore explodes categories AND keywords and sums the counts by both fields.

What’s the total video count for ads where “soymilk” is in the keyword and positive sentiment greater than 0.03 — break it out by content category? In this case, we are filtering on the unexploded keywords and then exploding the content categories.

Conclusion

Overall, we are really excited about the end result of our effort. This is a really simple solution to a really common and complicated data modeling problem.

Results

We can finally fully leverage compound value modeling. In the modern big-data stack, this is a powerful solution to reporting complex data relationships in real-time.

Our data product is more robust and flexible. We used to think of lateral flatten as a “cleaning transformation” that would be performed during ETL and curation. By pushing this transformation to the reporting layer, the user has dynamic control of this process. We do not have to rebuild and rerun processing pipelines to add new data and answer new user questions.

This simplifies our data ecosystem. We have stripped away a lot of the overhead and complexities, and are left with a straightforward approach to exploring data. In the described case, we would have previously used eight tables to represent all the combinations of dimensions, now we use only one.

Next Steps

Find the limits for this solution. While we are really happy with the performance improvements, delaying transform until reporting obviously has its limits. If users query this data frequently, we will likely want a hybrid approach where some data is made available as already exploded.

Design how to UX-gate users to avoid common pitfalls of compound value dimensions. One common issue with compound value dimensions is double counting. While the measure counts are fundamentally correct, interpreting the many-to-many relationship can be confusing for users. The simplest example is — summing up the total of the measure column will result in more of that measure than was actually recorded. We are exploring UI elements (warnings, labels, column names, etc.) and alternative measure formats (normalized percentages, fractional displays, etc.) to determine what works best.

We’re always looking for new talent! View jobs.