John Mastro
Feb 20 · 3 min read

Creating pivot tables is a relatively common need, as seen by its popularity in software like Microsoft Excel and Google Sheets. We built a small tool to create pivoted database tables. It can be used as a standalone tool or integrated into ETL pipelines.

Here’s an example of what a simple pivot table transformation might look like:

Background

At Ro, our business and operations are growing quickly in terms of both scale and complexity. We launched at the end of October 2017 with a single brand (Roman) treating a single condition (erectile dysfunction), but in the last few weeks we’ve added a new brand (Zero, for smoking cessation) and several conditions under Roman (hair loss, cold sores, genital herpes, and premature ejaculation), with more on the way.

For the data team, this means our data warehouse needs to support business units and analysts who want to look at data through a variety of lenses. They may want to focus on the overall company, on a particular medical condition, a particular brand, or some cross-cutting category (all prescription medicines vs OTC products). Pivot tables are one important tool for making it but easy and efficient for them to ask and answer the questions they’re interested in.

For context, our data warehouse is in Amazon Redshift, and our solution is currently Redshift-specific (though it would be straightforward to adapt it to most other databases).

The usual way

At the database level, the most common way to write pivot queries is probably to hand-write chains of CASE statements. This is unsatisfactory because it’s labor-intensive, error-prone, and needs to be updated by hand every time a new column is added or a new case becomes possible.

Alternatively, pivoting may be done outside the database, using a spreadsheet program or a business intelligence tool like Looker. This doesn’t help if it’s important to have the pivoted data in the database, whether for performance reasons or because it’s used further in your data pipeline.

An alternative

We built a simple way to generate pivot tables (or just pivot queries), which we’ve posted on GitHub here. The README discusses how to use it, including some examples.

The implementation strategy is to take the list of pivot columns, find every existing combination of values in those columns, and then generate appropriate pivoted columns. An aggregation function and default value can be specified individually for each pivoted column, or the script will attempt to guess reasonable defaults based on the column’s name (i.e. SUM will be used for a column whose name starts with `total_`).

The result is that we can declaratively specify pivot tables and have them automatically regenerated. The changes necessary to support new columns are either non-existent or minimal and localized.

Other alternatives

Spreadsheets and business intelligence tools can create pivot tables, but that’s not always a replacement for having an in-database pivot table.

Some databases include related functionality (like PostgreSQL’s crosstab table function or Microsoft SQL Server’s PIVOT feature).

The popular tool dbt also includes a macro for creating pivot tables in the dbt-utils package.

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

John Mastro

Written by

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade