Using Postgres Aggregate Functions

As the leader in performance driven merchandising we deal with lots of data. The fun part is helping customers unlock insights from their data. This is a real world example where we solved a customer need using our workhorse database server Postgres.

The ask was simple: “We have two sets of UPC codes used for tracking purposes. The products referenced by those codes are exactly the same but the UPC codes are different so we can track them in separate campaigns. We want the option to view presence percentage about each UPC code individually but also be able to combine the two UPC codes so the presence percentage is dependent on either one being available during a store visit. Most importantly, this should work for existing products without the dual UPC behavior.”

To unlock this insight, we use Postgres aggregate functions. We utilize an ETL flow to manipulate our data so we have two columns product_id and mapped_product_id. Product Id always refers to the original UPC and mapped product id will be present when one product should be treated like another product. Our simplified table looks like this:

shelf_conditions(is_present, product_id, mapped_product_id, visit_id)

is_present is a boolean column and indicates if the product was found or not and is required

product_id is the product id and is required

mapped_product_id is optional and only present for those instances where we want to combine the product_id. When the customer asks us to combine data they want the result rolled up as this mapped_product_id

visit_id refers to the day and store we visited when we collected the information and is required

Our query to unlock the combined product presence for a visit becomes:

SELECT coalesce(mapped_product_id, product_id) as product_id, bool_or(is_present) as is_present, visit_id 
GROUP BY 1, visit_id

Let’s break down the query:

coalesce allows us to select the first non null value starting with mapped_product_id and then product_id. mapped_product_id is not null when there is dual UPC behavior.

bool_or is an aggregate function and returns true when at least one of the input values is true

GROUP BY 1, visit_id indicates that we want to group by the results of the first column and the visit_id. We can’t use resulting alias product_id yet in the group by since that is a result of coalesce which is not available at the time the group by is processed. We could alternatively have written group by coalesce(mapped_product_id, product_id), visit_id instead of 1, visit_id.

This query can handle the combined UPC case but also the default case where there is no combined UPC since the output is product_id, is_present, visit_id. We then can use this output in our existing flows and queries thanks to the facility of Postgres aggregate functions.

Like what you read? Give David Lee a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.