Switching to a better data warehouse naming convention
Dan is the Founder of Outlier, your fractional data team | weareoutlier.com
Dataform is still a relatively young company, and so doesn’t yet have a dedicated “data” team. That said, most people within the company have an interest in data and are proficient in SQL — unsurprisingly, given the product we build. Because of these two things, ownership of our analytics code (managed in Dataform) is distributed across the team.
Our goal is for everyone to feel comfortable both contributing to the data modelling code, and using the data. But last week we realised we weren’t meeting this goal.
Lewis, our CTO, mentioned in a conversation we were having that he often found it difficult to know where new additions to our data model belonged. More than once, this has led to him building something from scratch, which we later realised was already covered somewhere else.
Josie, who leads marketing at Dataform, experienced the problem from the other side. As a growth marketer, she’s keen to track the impact of all her efforts. But finding the right tables to use, and being confident that she was looking in the right place, was not easy.
As the person who spends most time in our project, I was somewhat blind to these issues. The majority of the code was either written, or reviewed, by me — so I usually know where things belong, and how to get what I want. But on hearing the challenges Lewis and Josie were having, we decided it was time for a change.
Identifying the issue: names
As a first step in addressing the issue, the three of us got together to try and understand the problem a bit more deeply.
For our CTO, the issue was with development: they wanted to make some changes to our code, but weren’t quite sure where those changes should go. They were recently conducting an analysis to understand how many schedule runs each of our customers do each day, but it wasn’t clear where that metric should fit into our existing tables.
For our marketing manager, the issue was more in consuming the data: of all the 100s of tables, which was the “right” one to use for an analysis? They’re very focused on user acquisition, but there are at least ten tables that look like they might be relevant: user_stats, user_attribution_stats, users, daily_user_stats, …
After a brief discussion, we hypothesised that most of our problems could be solved if we structured our data warehouse more clearly and had a clearer naming convention for tables. Our analytics codebase is broken into modular, self-contained chunks, and each chunk is a dataset (table or view). If the name of each dataset fully described its contents, it would make both development and consumption much simpler.
Setting a naming convention
Our hypothesis was that if we named our datasets more clearly, it’d both be easier for developers to contribute to our data model, and users to consume our data.
Properties of a good name
Ideally, you’d be able to see the name of a dataset and have a very clear idea of what data it contains — what it’s purpose is.
- Domain: We analyse data from many different sources, and so it’s important first to understand the domain of the data. We use Stripe, Intercom, Segment and more, and each contain information about users. The name should tell us which flavor of users that data describes: stripe users, pipedrive users etc
- Grain: One of the most important things to know about a dataset is what an individual row (or record) represents — this if often called the grain of the dataset.
- transformation stage: The data modelling logic we process in Dataform takes raw data, cleans it up, and finally presents it in a way that is optimized for analytics. The name should clearly indicate whether a dataset is intended to be used for analytics or if it is a stepping stone along the way.
Structure of a data warehouse object (i.e. table) name
Most data warehouses follow a similar pattern for the names they allow you to give to a dataset:
database_name.schema_name.table_name
The next step was to decide where in this structure to include the domain, grain and stage information. After some discussion, we came up with a plan:
- databasename: should differentiate between _environments. For us, this means we have
dataform_corp
(our production environment) anddataform_corp_dev
(our development environment, more on environments here). - schema_name: differentiates between domains, e.g.
stripe
,intercom
,segment
. - table_name: describes the grain of the dataset.
So, we might have a table called dataform_corp.segment.users
. This is a table in our production environment that has one row per Segment user.
Treating transformation stages as domains
The naming convention described above still misses one piece: the “transformation stage”. Looking at the existing datasets in our warehouse, we realised that they fit into one of three buckets:
- Group 1: Tables that are created as building blocks along the way to build something else. The are essentially a by-product of the data modelling process, and (in general) shouldn’t be used elsewhere.
- Group 2: Tables that are not intended to be used for analytics, but are a core piece of the data model, and are used by many downstream tables
- Group 3: Tables that are outputs expressly designed for consumption in analytics tasks. These are generally at the end of the dependency graph
After considering a few approaches, we decided that using the schema
to clearly differentiate between the three groups made the most sense:
Ephemeral
: A schema for all tables in group 1. These can essentially be ignored by everything apart from the Dataform pipeline.Core
: A schema for all tables in group 2. These will form the backbone of our data modelling, but are generally not intended for consumption by BI tools, analytics etc.Reporting
: Group 3 tables. These are the primary outputs of the data modelling process. Anyone wanting to do analytics in our data will probably find what they're looking for in this schema.
One important part of this structure is that datasets in the core
schema should never depend on datasets in the reporting
schema. This simple rule enforces structure on our data model, and that structure ultimately helps to minimise the length of our dependency graph and reduce the likelihood of getting tripped up by circular dependencies.
Rolling out the change
Now that we’d agreed on a naming convention, it was time for me to refactor our data modelling code. Because we use Dataform, all of our data modelling code is in one place — in fact it’s all in one GitHub repository — so I only have to make updates in one place.
The first step of course was to create a new branch. Because I was making such a big change, it was important that I could isolate my changes from our production data modelling code, so that people could continue using our data for analytics whilst I was in progress.
Given the ephemeral
schema is the one that most users of our data won't see, I decided this was a good choice for the default schema. If someone creates a new model, they'll need to actively assign a different schema if they think it should be core
, reporting
etc.
I then worked my way down the set of files and folders in our project, swapping out filenames (which are used as the table names in Dataform) and schema names (set in the config {}
block) for their replacement in the new convention. Each time you do this, there's a good chance that you'll break a link in your dependency graph. Fortunately, Dataform immediately alerts you to the issue, and which file you can find the problem in, so you can go and correct it. More often than not, I needed to update a ${ref("table_name")}
to use the new name.
After roughly two hours, I’d been through all of the files in our project (over 100), renamed everything to meet the new naming convention, and resolved all compilation errors.
The final remaining step was to kick off a deployment of our entire graph — again in a development environment so that our production data isn’t affected — to ensure there were no failures:
With that, I created a PR and sent it to one of my colleagues for review.
Summary
It’s early days for this new structure, but so far we’re happy with the results. It’s much clearer what the contents of a dataset is, which makes it much easier for users of the data, and also a lot easier for developers to understand how to contribute.
I don’t imagine that this naming convention will work for everyone. If your warehouse is simpler than ours, this may all seem a little over the top. On the other hand, if you’re a significantly larger and more complex organization than Dataform, you may need more layers of transformation stages, or may need an ephemeral
schema for each domain.
In fact, I wouldn’t be surprised if we need to go through this process again in a year’s time. But I’m OK with that! Because all of our code is version controlled, managed in one place, and Dataform automatically checks for consistency across the entire dependency graph, making the update was painless.
Have you been through a similar process at your organization? Where did you end up? Have you got suggestions for a better naming convention? If so, I’d love to hear from you! Come and join our community of data practitioners in Slack and share your thoughts and ideas.
Originally published at https://dataform.co.