When building a data warehouse or datamart, you often want to define derived tables in terms of SELECT statements, preferably without needing explicit DDL (i.e. CREATE TABLE) but without giving up the ability to use primary keys, foreign keys, or important database-specific features like Amazon Redshift’s DISTSTYLE, DISTKEY, and SORTKEY. We built a small tool at Ro that makes this easy to do without needing to pull in a larger data pipeline-management tool.
At Ro, our business and operations are growing quickly in terms of both scale and complexity. We launched at the end of October 2017 with a single brand (Roman) treating a single condition (erectile dysfunction), but in the last few weeks we’ve added a new brand (Zero, for smoking cessation) and several conditions under Roman (hair loss, cold sores, genital herpes, and premature ejaculation), with more on the way.
For the data team, this means we need to be able to react quickly to changes in the underlying products and business. For instance, if half a dozen columns are added to a table in the production database, we want to minimize the amount of time and work it takes to add those columns to appropriate derived tables, while also minimizing the chances that the additions will break anything.
For context, our data warehouse is in Amazon Redshift, and our solution is currently Redshift-specific (though it would be straightforward to adapt it to most other databases).
The usual way
If you’re defining your derived tables with standard CREATE TABLE and SELECT statements, then you’ll need to add the new columns in both places. If you skip the CREATE TABLE and use SELECT INTO (or its moral equivalent like CREATE TABLE AS) instead, you’re down to one place to change, but now you have a different problem — you can’t define primary or foreign keys, or take advantage of Redshift’s DISTSTYLE, DISTKEY, and SORTKEY features.
We built a simple way to get the advantages of SELECT INTO without losing the ability to specify keys, which we’ve posted on GitHub here.
The README discusses how to use it, plus a couple notable limitations.
The implementation strategy is to first run a variant of the query with SELECT INTO and LIMIT 10, creating a temporary table with a small sample of rows. PostgreSQL and Redshift select appropriate types for each column, which we can then access via pg_table_def[¹]. We use this to automatically generate an appropriate CREATE TABLE statement, execute it, and then INSERT INTO the final table.
The fact that we’re generating a CREATE TABLE statement means we have somewhere to put key definitions, so you can optionally specify a primary key, foreign key, diststyle, distkey, and sortkey in a YAML configuration file. The README gives an example of what this looks like in practice.
So now we can have our cake and eat it too — the simplicity of SELECT INTO plus the power to specify keys that we would otherwise lose.
In PostgreSQL, you might use materialized views for this. They can be recreated and refreshed as needed and they support indexes (including unique indexes) but not foreign keys. Unfortunately, they’re not supported in Redshift.
You could also use a tool like dbt to accomplish this. However, something like dbt does a lot more than this one thing[²], and there can be value in keeping things simple.
: Or information_schema.columns, which is also available in PostgreSQL, but somewhat less convenient to use.
: It has a templating language, macros, a scheduler, a dependency resolver…