Real World Use Case for MAX_BY in DBSQL — Gold Layer Views

Databricks SQL SME

Published in

DBSQL SME Engineering

6 min readMay 9, 2024

My favorite use case for MAX_BY in SQL — creating gold level views of data with column-level control

Creating a unified view from disparate data sources with MAX_BY

Author: Cody Austin Davis

Introduction

This is a short and sweet article describing one of the most common real-world use cases for using the MAX_BY/MIN_BY aggregate function. This function is not seen as often in the wild, mostly because people don’t know it exists. Not only is it a great lesser-known function, but once you know about it, you’ll begin to see many ways to use it across your data. Today, we will cover one that I see most often — creating unified views for data across data sources.

What is the use case?

Companies often have multiple data sources of data. CRMs, app data, public data sources, and even data received from different customers / stakeholders in a supply chain. When this happens, its common for there to be data that comes in from multiple data sources that actually represents a single entity. For the example, lets say that entity is a “customer”. If you run a retail website, you may obtain customer data from multiple places, and furthermore, you may get different data points for that customer across those data sources. Orders, customer profiles, returns, reviews, etc. etc. In analytics, most of the valuable insights only come if you can stitch all these data sources together into a complete and unified view for a given customer. This is where MAX_BY comes in and makes things easy!

Lets dive into an example. We will simulate the following system where we may get customer information from multiple data sources that all go into a “customers” table. Then all those customer records from all data sources will be stitched together into a “unified view” of each “master customer” that represents the most recent definition of that customer across all sources.

Here is an example output of the customer table, assuming each record is from a separate data source:

Usually this requires a sort of fuzzy matching Entity Resolution (blog coming soon on this topic) system for this type of use case, but for the sake of brevity, we will assume that we are already able to “stitch” these disparate records together to identify entities that are the same and link them with a master_customer_id. This can also apply to other scenarios where we just want to create a unified view for an event across entities as well (customers, orders, returns, etc). We will review 3 ways to create the “Unified View”.

1 — Pick the most recent customer record

This is the most obvious idea. We simply make the “master customer” record be whichever record has the most recent update. This is easy and makes sense, but it has downside of not being able to effectively utilize the data from all data sources at the same time. For example, if the Shipment Data Source is missing phone number but the Order Data Source has phone number, but Shipment Data Source is the most recent record, our view will show a NULL phone number for our master view. Our picture is not complete and not robust. Our SQL example of this view and output looks like this:

-- Pick most recent master id record with window function + QUALIFY
SELECT
*
FROM customers
QUALIFY ROW_NUMBER() 
  OVER (PARTITION BY master_customer_id ORDER BY update_timestamp DESC) = 1;

Output of the above query for a single master customer Id record

There is a significant limitation with this method. In our small example, we are missing some data. Look at SSN. If another data source has this attribute but is not the most recent record, that data will not be used. Lets fix this problem.

2 — Group By the “master” entity and pick a random record for each column

This is the most common implementation I see. It is great for some use cases and allows us to utilize data from all customer data sources for a single output record, so if 1 data source is missing an attribute, but another has that attribute (i.e. Shipment Data Source is missing phone number but the Order Data Source has phone number), you can “fill in the blanks” across data sources to ensure you have a more complete picture. However, this is not a very robust implementation, as you still are not picking the most recent record update for each column. You are just using a naive aggregate function like MAX/MIN to just pick a non-null value for each column. This implementation looks like this:

SELECT
  master_customer_id,
  MAX(customer_name) AS customer_name,
  MAX(state) AS state,
  MAX(city) AS city,
  MAX(postcode) AS postcode,
  MAX(street) AS street,
  MAX(`number`) AS number,
  MAX(ship_to_address) AS ship_to_address,
  MAX(ssn) AS ssn,
  MAX(phone_number) AS phone_number
FROM raw_customers
GROUP BY master_customer_id;

This is better! We have a more complex dataset, but there is still one problem: We have no control over which record this query chooses outside of alphabetical order. Our unified view of a customer could contain old/less relevant data.

3 — Use MAX_BY to customize which record is picked with control and purpose

This is the best way to create a view of a customer (or any other record) intentionally. We will use the MAX_BY function with optional FILTER clause to pick the most recent NOT NULL record for each column. This way, the view can decide to use the most recent not null record by column (instead of by row), regardless of data source all in one record. Here is how we do that:

-- Add filtering rules to ensure bad data doesnt make it into the "final" cleaned up master view
SELECT
  master_customer_id,
  MAX_BY(customer_name, update_timestamp) FILTER (WHERE customer_name IS NOT NULL) AS customer_name,
  MAX_BY(state, update_timestamp) FILTER (WHERE state IS NOT NULL) AS state,
  MAX_BY(city, update_timestamp) FILTER (WHERE city IS NOT NULL) AS city,
  MAX_BY(postcode, update_timestamp) FILTER (WHERE postcode IS NOT NULL) AS postcode,
  MAX_BY(street, update_timestamp) FILTER (WHERE street IS NOT NULL) AS street,
  MAX_BY(`number`, update_timestamp) FILTER (WHERE `number` IS NOT NULL) AS number,
  MAX_BY(ship_to_address, update_timestamp) FILTER (WHERE ship_to_address IS NOT NULL) AS ship_to_address,
  
-- Check formatting for example
  MAX_BY(ssn, update_timestamp) 
      FILTER (WHERE ssn IS NOT NULL 
              AND length(regexp_replace(ssn, '\\D','')) = 9 ) AS ssn,
  MAX_BY(phone_number, update_timestamp) 
      FILTER (WHERE phone_number IS NOT NULL 
                AND length(regexp_replace(phone_number, '\\D','')) = 10) AS phone_number
FROM raw_customers
GROUP BY master_customer_id;

This code is much more robust and allows us to not only get the most recent record value by column (instead of only by a full row), it also allows us to add optional FILTER clauses to control the quality of the selected record. In the example above, we select the most recent record for each column as long as the column value is not null. For more important columns, we can also add more rules that check quality before it selects the record. In the example above, we only select the most recent phone and social number that are not null AND have valid values according to the context of the data itself. This is the most robust method to create a single view of an entity.

Thats it! MAX_BY is not commonly seen, but hopefully this helps you think of more effective and creative ways to build out your data model.

The full code of the notebook example can be found here to try out yourself.