Flexible Calculations in a One-to-Many SQL Relationship

Image of the word SQL being held up by a crane

Background

At GumGum, we process more than hundreds of TB of data each day. Storing every source table is a luxury. In reality, we have to crate aggregated tables from source tables to reduce data volume. Those aggregated tables will have a relatively long retention period, and they should be robust enough to serve a wide variety of reporting needs. During this process, we faced the challenge of how to provide flexible calculations after joining two source tables at different granularity.

A Generic Solution

WITH c
AS (
SELECT a.*
,b.*
,Row_number() OVER (
PARTITION BY < join key >
ORDER BY < criteria you want to solve by >
) AS row_number
FROM a
LEFT JOIN b ON a.a_id = b.b_id
)
SELECT c.*
,CASE
WHEN COALESCE(row_number, 1) = 1
THEN < calculated_column >
ELSE 0
END AS dupes_excluded_calculated_column
,< calculated_column > AS dupes_included_calculated_column
FROM c

Production Application

inventory
ad_events

The goal is to create an aggregated table including inventory_count from inventory and view from ad_events. It’s easy to tell the join condition is inventory_id in each table. However, since inventory_id is not unique in the ad_events table, the resulting table inventory_ad_events has duplicated inventory data.

inventory_ad_events

The current inventory_count only makes sense when you want to group by dimension advertiser_id. But you will get an incorrect sum of inventory_count when you need to group by a dimension such as page_url or inventory_id. Namely, advertiser_id, page_url and inventory_id are not on the same granularity due to the fact that advertiser_id is on the “many” side. According to the generic solution, we can treat advertiser_id as the “criteria you want to solve by” and inventory_id as the “join key” to get the row_number of different advertiser_id partitioned by inventory_id. When row_number is 1 or null, we keep inventory_count as is. Null value will happen when the inventory has zero advertisement events. Otherwise, we overwrite those inflated inventory_count as 0. Below is the detailed select statement to derive a new table, new_inventory_with_ad_events:

WITH c
AS (
SELECT i.inventory_id
,i.page_url
,i.inventory_count
,e.advertiser_id
,e.VIEW
,Row_number() OVER (
PARTITION BY inventory_id ORDER BY advertiser_id
) AS row_number
FROM inventory i
LEFT JOIN ad_events e ON i.inventory_id = e.inventory_id
)
SELECT inventory_id
,page_url
,advertiser_id
,SUM(VIEW)
,SUM(CASE
WHEN COALESCE(row_number, 1) = 1
THEN inventory_count
ELSE 0
END) AS dupes_excluded_inventory_count
,SUM(inventory_count) AS dupes_included_inventory_count
FROM c
GROUP BY inventory_id, page_url, advertiser_id
new_inventory_with_ad_events

Now we can answer different business questions from one aggregated table. For example:

What is the inventory_count and advertisement view for each advertiser per page_url?

SELECT page_url
,advertiser_id
,sum(dupes_included_inventory_count) AS inventory_count
,sum(VIEW) AS VIEW
FROM new_inventory_with_ad_events
GROUP BY page_url
,advertiser_id
Query Result

What is the total inventory_count and advertisement view per page_url?

SELECT page_url
,sum(dupes_excluded_inventory_count) AS inventory_count
,sum(coalesce(VIEW, 0)) AS VIEW
FROM new_inventory_with_ad_events
GROUP BY page_url
Query Result

Summary

SQL Joke: A SQL query walks into a bar… in one corner there are two tables. The query walks up and asks, mind if i join you?

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | Linkedin | Instagram