Data Modelling for Startups (Part III)

Paddy Alton
Apolitical Engineering
11 min readMay 15, 2023

This is the final article in a three-part series.

Introduction — data in practice

Over the last five years at Apolitical, the global platform for government, I and my team had the chance to do something exciting: to build up the company’s data infrastructure and systems from scratch. In this trilogy of articles I’m taking the time to reflect on that experience.

In the first article of the series I discussed data modelling theory and why data modelling is worthwhile.

In the second article I focused on infrastructure. In particular, I discussed how modern tools allow small teams to quickly set up a chain of infrastructure to extract data from sources, load it into a data warehouse, transform it into data models, and serve it via a business intelligence tool.

Putting it together

I now want to bring together the first article (which discussed data modelling in theory) with the second, which covered the prerequisites for doing data modelling as a startup.

First, a warning:

Don’t try to do everything at once

If you try to make everything perfect you’ll be at it forever — without anyone seeing any benefit.

Worse (and this is especially relevant to startups): the data in your source systems is liable to change. If you yourself don’t have full control over the source data, you’re eventually going to need some processes to govern these changes. But there is no point trying to design that framework and getting everyone to abide by it up-front: you need to give people a source of value before you put in place processes to protect that source of value.

So: talk to senior business leaders (or, if you are one, have yourself a good think). You need to identify the areas where the business wants to improve.

Once you’ve figured that out, here’s what you do …

A bold advance

The point of data modelling is to morph data into a form that

  • matches business intuition about how things are
  • can be readily consumed by BI tools and/or data analysts
  • acts as a single source-of-truth

In this section I’m going to assume you have set up your data warehouse and identified a BI tool you could use immediately on data in that warehouse.

Setting a course

We want to find a business process that the company lacks insight into, where that insight would help the company to make important decisions.

(not ‘understand the landscape better’; not ‘give us some insight into how things are going’; make decisions.)

Planning a data model

Once you’ve identified such a process, you can model all the data related to it. First, you need to have a chat with whomever is the (closest thing to an) owner of that process. This is important because you need to understand how that person sees the business. It’s their intuition your data model needs to match.

Relating this back to the first article in the series, perhaps you choose to model the company’s sales data, and your stakeholder affirms that they need to understand sales broken down by time period and attributes of customers and products:

A blue square labelled ‘Sales’ is surrounded by three red squares labelled customers, products, and dates. Arrows point from these to the central blue square.
N.B. you may not need a ‘dates’ table if your BI tool can handle that for you. Most will.

You may to have guide them through this process. One way to do that is to ask them for a list of open questions that are on their mind, and then talk to them about how the answers would help them to make business decisions (for example, where to spend money on advertising).

So, we have a target data model — we need a fact table to capture the business process (the completion of sales) and we need dimension tables to slice-and-dice that data.

What are you working with?

Next, you need to inventory the various data assets that exist in source systems that are related to your data models:

  • where are data about customers and products stored? A database? Different databases? A CRM system?
  • where are data about sales stored? A payment provider? A database you control?

(if you don’t have the assets you need, you need to have a data engineering conversation before anything else, and you may need to get on with some other data modelling project while that proceeds)

Centralising

The point of figuring out what data is available and where it resides is that you need to figure out how you will get it into your data warehouse.

For example, with BigQuery it’s actually possible to bring in data from Cloud SQL and Google Analytics 4 without any third-party tools or bespoke code.

For other sources — especially things like payment providers and CRMs — you may not be able to create a direct connection and will need to set something else up. This is usually the point to try out some managed pipeline tools.

(N.B. you can/should pick a managed pipeline tool based on whether they support the sources you actually have! You may need to have a few more conversations with stakeholders to figure out what sources you might need to accommodate later.)

Whatever you do, the key thing is to get as much relevant data into the warehouse as you can without spending obscene amounts of time on it.

Modelling

Remember, the point of landing data in your warehouse in close-to-raw form and doing the transformation in situ with SQL is that you can always quite easily change how the transformation is done. So don’t sweat this bit too much.

You can do the first version of your first model with SQL views or scheduled queries (if your warehouse supports them and if it’s easier than setting up DBT). Or you can set up DBT (alternatives are available) and take a little longer initially, in exchange for less messing around later.

It’s best to create models via a few separate steps. DBT best practices suggest (for good reason) that you consider creating data models in three stages:

  1. base/staging models apply light cleaning and tidying to your incoming source tables (e.g. renaming columns, removing invalid rows and irrelevant columns, changing data types by parsing dates etc.)
  2. intermediate models apply key transformations (combining related base tables into fewer, less-normalised tables, pivoting base tables with key/value columns
  3. fact and dimension tables built by combining intermediate models into a form that matches the business understanding of the data

Here is an updated diagram of the full system, reflecting this multi-stage transformation:

A diagram showing data flowing from source, via pipelines, to a warehouse; then being transformed into data models in the warehouse in stages (raw to base to intermediate to models) using a data transformation tool; then being read out of the warehouse into a business intelligence tool.

If you do it this way, you will tend to find that you spend less time copy/pasting SQL, because you can reuse the base and intermediate models.

Remember, dimension tables need a unique ID. If there isn’t an existing unique ID a surrogate key should be created. This is done by hashing (combinations of) immutable attributes that together uniquely identify the entity represented by the table.

(for this purpose, I found the generate_surrogate_key macro provided by the dbt-utils package invaluable)

An example

It may help to recap what we’ve covered so far with an example. Let us imagine we have three source systems holding relevant data:

  1. SQL database (holds customer and product data)
  2. CRM system (holds customer data)
  3. Payment provider (holds customer, product, and sales data)

We then set up off-the-shelf data pipelines and land tables from these three systems in our warehouse. We call these ‘source tables’.

Next we define DBT models:

  • the base models simply capture the source tables with irrelevant data dropped, consistent naming conventions applied, and some parsing of data implemented— models are grouped by source system
  • the intermediate models combine data from different sources and denormalise the SQL schema as needed, executing joins to create tables related to customers, products, and sales
  • finally, three dimension tables (customers, products, dates) and one fact table (sales) are assembled

These DBT models are then materialised on schedule in the warehouse.

Analysis

The above setup ensures that

  • reasonably fresh data will always be available
  • data will be in a form that matches the business understanding of the data
  • there is a single source-of-truth for the data

At this point you want to load the data into your data visualisation tool and bring your stakeholder back into the room — they are the person best-placed to interrogate and take action based on the data, after all.

The best outcome at this stage is that

  1. you are able to quickly answer many of their immediate questions
  2. you can pick a few charts that will make a good dashboard (much like a car dashboard, a good business dashboard is one whose contents are readily actionable: think ‘refill engine oil’, not ‘inspiring quotes on a stock photo’)
  3. you leave with a big pile of new questions that your model can’t answer (yet)

That last one is important. I entitled this section ‘a bold advance’ because we leapt forward, delivering a specific piece of value in one narrow ‘vertical’.

Systems are stacked in layers, with business intelligence at the top, data warehouse/transformation in the middle, and pipelines/source systems at the bottom. Arrows show the flow of information up into the BI layer. The diagram makes clear that there are other source systems that haven’t been used yet, and room for plenty more data models to be built.
Building out a vertical. There’s plenty of room to grow sideways.

The thing to do next is to work with your stakeholder to iterate your models, to grow them out horizontally — i.e. add new pipelines for other source tables, expand the models, add new facts and dimensions, create new charts and reports.

The future

In the short term, you now have business-ready definitions of customers, products, and sales. It seems likely you can reuse those later. We often refer to a collection of related data models as a ‘data mart’ (figuratively: a market stall for a certain kind of data). This is what you build via that process of horizontal iteration.

You should be sure to spend at least some time with your first stakeholder, iterating and growing their models — you want to ensure they are happy with what you’ve delivered!

That said, you need to follow the 80/20 principle: 20% of the effort needed to make it perfect will get your stakeholder 80% of the available value. Eventually you need to move on to other areas where there is high urgency, high impact work to be done.

In the medium term, you can start talking to other stakeholders — you now have a case-study for what you can do for them, and the infrastructure in place to do it faster. You can follow the same prioritisation process as before.

And then the trouble will start.

What will go wrong?

If you’ve followed the plan so far, you’ve put yourself in a good position. Multiple areas of the business are now being served high quality, consistent data that helps your colleagues to be more effective at their jobs. So much so that they now rely on their data, the reports and dashboards you’ve built.

So what happens when it breaks?

There are a bunch of reasons why this can happen. Broadly speaking, they fall into three categories:

  • change to the source data
  • infrastructure failure
  • data modelling bug/failure to understand the source data

Different sorts of problem require different sorts of tactics.

Change to the source data

A data pipeline is only as good as the source. There are lots of potential problems lurking:

  • expired access credentials
  • dataset moves/is renamed
  • dataset schema changes (columns deleted/renamed, data type changes)
  • previously unencountered values added to dataset (that your pipeline — or BI tool! — can’t handle)

If your pipeline is a ‘pure’ pipeline that doesn’t do any nonessential transformation, you at least decouple potential data modelling problems from the above.

How can you fix these problems? Usually prevention is better than a cure, but you’ll be called on to do both.

Prevention: talk to the people in charge of the source systems! You may be able to protect the source system from unexpected changes through a combination of access controls and processes, but you need to make sure the stewards of those systems understand why those controls and processes are important. Try to get them to involve you at the planning stage when there are changes to be made to the source.

You may even be able to get tests put into the system that run at the point-of-deployment and that ensure certain guarantees will be met (essentially, unit tests for data production).

Cure: you want to avoid picking up pipeline problems after a week when someone realises their report is broken.

You should consider adding monitoring and alerting to your pipelines so that you get a warning as soon as it breaks. Many off-the-shelf tools come with such capabilities built in.

For a low-lift approach: DBT itself ships with data tests and source freshness tests. So if you’re using it (or a tool with similar capabilities) you can

  • check that your source datasets (as landed in your warehouse) are as fresh as you expect (this can catch pipeline problems)
  • check that the data matches your expectations (e.g. no null values, only certain categorical values in a column, only unique values in a column)

Infrastructure failure

There may be nothing wrong with the source data, but the infrastructure may just let you down.

I’ve tried to head this off by suggesting you avoid managing too much of it yourself (although vendors can also let you down …) — at least at first. Monitoring and alerting is the solution here, too.

Data modelling bug

Or alternatively, ‘failure to understand the source data’.

This occurs when

  • your pipelines deliver all the data to your warehouse on time without experiencing any failures
  • the automated checks you’ve put in place all pass, suggesting everything should be fine
  • but an angry stakeholder still comes to you with a broken report!

This can be related to the issue of encountering new, unhandled data values, if you didn’t exclude dangerous values via your tests.

Alternatively, maybe that source table doesn’t quite represent what you think it does, and maybe that only becomes clear under certain circumstances (is there really a 1–1 relationship between customers and delivery addresses?)

Prevention: take a good long look at that source system, chat to whomever owns it, and make sure you understand how it’s being used/how it will be used in future. In other words: it pays to put in effort when you’re building the models.

Cure: tweak the data model (tools like DBT make this easy) to reflect your new understanding. If you’re lucky, the multiple layers of transformation will allow you to keep the structure of your actual data models the same, with the tweaks made to base and intermediate model. If so, you won’t have to update your actual reports.

The future (redux)

To recap:

  • short term: iterate, build out a data mart
  • medium term: engage new stakeholders and start building marts for them; start implementing data management tooling and processes

What about the long term?

If all goes well, you should be able to keep building out new marts while providing reasonable guarantees about data availability.

Naturally, success brings new problems — as the organisation scales, and the data models grow in numbers, you may encounter discoverability problems. How do people know what datasets are available? What the lineage of your data models is? How they ought to query those models? And so on.

The way to mitigate this is simple: documentation. As you build your data systems and models, you should document them so that others can get up to speed quickly.

As well as centralised documentation, the tools you choose may well have some built-in ability to document aspects of the data. Some examples:

  • DBT comes with a built-in lineage view and a documentation UI
  • BigQuery now has a lineage view revealing where data came from
  • many BI tools allow you to add some level of documentation
  • GCP has a separate tool, Dataplex, that is interoperable with BigQuery and supports your defining data dictionaries and associating entries with BigQuery tables

It’s important to try to keep on top of this as you go. At first, you’ll only need minimal documentation, but as your data models become more complex, the need will grow.

Conclusion

So there it is — a large number of lessons learnt (some painfully) distilled into three chunky articles.

We’ve covered

This series has by no means covered everything — it was tough to decide what to leave out — and nor should it try. In the end there’s no substitute for practical experience. However, I hope that these articles will point you in the right direction. Good luck on your own data journey!

--

--

Paddy Alton
Apolitical Engineering

Expect articles on data science, engineering, and analysis.