Automatically create a pivot table in Redshift
Creating pivot tables is a relatively common need, as seen by its popularity in software like Microsoft Excel and Google Sheets. We built a small tool to create pivoted database tables. It can be used as a standalone tool or integrated into ETL pipelines.
Here’s an example of what a simple pivot table transformation might look like:
At Ro, our business and operations are growing quickly in terms of both scale and complexity. We launched at the end of October 2017 with a single brand (Roman) treating a single condition (erectile dysfunction), but in the last few weeks we’ve added a new brand (Zero, for smoking cessation) and several conditions under Roman (hair loss, cold sores, genital herpes, and premature ejaculation), with more on the way.
For the data team, this means our data warehouse needs to support business units and analysts who want to look at data through a variety of lenses. They may want to focus on the overall company, on a particular medical condition, a particular brand, or some cross-cutting category (all prescription medicines vs OTC products). Pivot tables are one important tool for making it but easy and efficient for them to ask and answer the questions they’re interested in.
For context, our data warehouse is in Amazon Redshift, and our solution is currently Redshift-specific (though it would be straightforward to adapt it to most other databases).
The usual way
At the database level, the most common way to write pivot queries is probably to hand-write chains of CASE statements. This is unsatisfactory because it’s labor-intensive, error-prone, and needs to be updated by hand every time a new column is added or a new case becomes possible.
Alternatively, pivoting may be done outside the database, using a spreadsheet program or a business intelligence tool like Looker. This doesn’t help if it’s important to have the pivoted data in the database, whether for performance reasons or because it’s used further in your data pipeline.
We built a simple way to generate pivot tables (or just pivot queries), which we’ve posted on GitHub here. The README discusses how to use it, including some examples.
The implementation strategy is to take the list of pivot columns, find every existing combination of values in those columns, and then generate appropriate pivoted columns. An aggregation function and default value can be specified individually for each pivoted column, or the script will attempt to guess reasonable defaults based on the column’s name (i.e. SUM will be used for a column whose name starts with `total_`).
The result is that we can declaratively specify pivot tables and have them automatically regenerated. The changes necessary to support new columns are either non-existent or minimal and localized.
Spreadsheets and business intelligence tools can create pivot tables, but that’s not always a replacement for having an in-database pivot table.
Some databases include related functionality (like PostgreSQL’s crosstab table function or Microsoft SQL Server’s PIVOT feature).