Data Modelling for Startups (Part II)

Published in

Apolitical Engineering

9 min readMay 15, 2023

This is the second article in a three-part series.

Introduction — data infrastructure

Over the last five years at Apolitical, the global platform for government, I and my team had the chance to do something exciting: to build up the company’s data infrastructure and systems from scratch. I’m taking the time, in this trilogy of articles, to reflect on that experience.

In the first article I discussed

what data modelling is
why it’s worth doing
some of the thornier issues you’ll encounter in the real world

In short: I covered some essential information for those new to data modelling.

In this second part, I want to move on to talk about infrastructure. When you’re starting from scratch, what do you need to build in order to create data models as quickly as possible and start deriving benefit from them?

In the third and final article I will draw these two strands together, offering examples and practical advice.

Why are we here?

The goal of your efforts is to take the data the company collects, and use it to inform business decision-making so that the company can achieve its objectives.

If you come to this with nothing already in place, you might feel as though you’ve been given a plan like this:

data in source databases
???
business insights!

Let’s talk about how we fill out that bit in the middle.

Let me give you some advice…

Here’s the most important thing to remember: you need to deliver value continuously, from the beginning.

It’s no good planning a big project to set up the perfect system and then help the business. The company needs to see value in weeks, not months.

Fortunately, modern tools give us a way to achieve this.

The modern data stack

You’re going to hear a lot about ‘the modern data stack’, and for good reason — new approaches allow small teams to do so much more than would have been possible in the past.

But what is the modern data stack? Put simply, it is a collection of cloud-based tools that, working together, bridge that gap in your plan!

There are loads of options for the different parts of your stack, and some of those parts are optional. Here is what I think you need for a ‘minimum viable stack’:

A diagram showing data flowing from source, via pipelines, to a warehouse; then being transformed into data models in the warehouse by a data transformation tool; then being read out of the warehouse into a business intelligence tool.

This is the order in which I suggest you set them up:

a data warehouse (this is the heart of the system)
a pipeline tool (a way of loading data from your source systems into your warehouse)
a business intelligence/data visualisation tool
a data transformation tool (convert raw data into data models)

Let’s unpack that.

First, a note: I won’t tell you what the best options to use are, because I don’t know! Things move fast. Also, your organisation may have already made some technical choices that dictate what you do. But I will tell you what we use.

Data Warehouse

At Apolitical, Google BigQuery is our data warehouse. Most cloud providers have some sort of data warehouse solution: AWS has Redshift, Azure has Synapse, and GCP has BigQuery. We use GCP, so we went with BigQuery.

Often, people will consider a third party solution separate to what cloud providers offer (e.g. Snowflake, Databricks) to avoid vendor lock-in and get improved functionality; this will increase costs. In our case BigQuery suits our needs.

The data warehouse is the first thing you set up, because you need a central location to store all your data. The idea is this: by combining data from different source systems, you make it much easier to analyse. Combining data from, say, a backend database and a CRM tool becomes a matter of a SQL join between two tables in the warehouse, rather than a complex script that pulls directly from both sources.

A data warehouse should be optimised for crunching analytical queries. If you see the terms ‘columnar database’ or ‘OLAP database’ get thrown about, that is all they mean: a database that’s optimised for processing a column of data (e.g. computing aggregates).

(that’s the opposite tradeoff to a typical SQL — ‘OLTP’ — database, which is optimised for retrieving a row of data)

You could self-host your data warehouse, but I’d advise against it. This isn’t the place to cut corners. You’ll create a maintenance nightmare because

all your other tools have to be compatible with your warehouse
you have to worry about things like storage capacity
it’s likely to be less robust than a fully-managed solution

In any case, last time I looked, BigQuery had a free tier that allowed for up to 10GB of free storage + 1TB of total data processed per month, which may be enough for your purposes.

Pipeline Tool

Once you’ve got somewhere to put your data, you need to get it in there.

Naturally, it’s not enough to do it once! You need a tool that will pull data out of your source systems and put it into your warehouse either regularly or continuously.

Continuous data ingestion is ideal, but often (not always) more complicated than regularly loading batches of data into the warehouse. You should have a quick think about your business requirements. Often it’s enough for data to be at most a day old: a daily load job is sufficient.

There are a few approaches that work well for startups (in my order of preference):

built-in connections between source systems and warehouses
an off-the-shelf solution (fully managed or self-hosted)
‘serverless’ code set up to listen to a webhook
simple scripts running in a cloud scheduler

I’ve used all of these approaches at various times, but most recently we adopted Airbyte, which falls into the second category. I would recommend something like Airbyte (or its competitors, such as Fivetran) if

your data isn’t that ‘big’ (Gigabytes, not Terabytes)
your data is split across a lot of different sources
there isn’t a built-in connection between the source system and your warehouse

Naturally if you are paying for pipelines, your provider will have their own margin to consider. An advantage of an open-source system like Airbyte is that you can self-host (which is what we do). You’re still paying for compute (and for engineering time to set up and maintain the self-hosted version), but it can make economic sense to move to self-hosting later. The tradeoff of self-hosting is that you can’t expect any customer support.

An aside about data transformation

In a traditional data stack, your pipeline tool and transformation tool might be the same thing. And you might be tempted to take that approach, rather than set up two different systems.

This is especially true if you ended up writing your own pipeline code. Here is how it will happen: your data won’t be ready for use in its raw form, so you may think “I have this script that extracts data from source systems and loads it into the warehouse, and the script does a bit of tweaking for compatibility reasons … so why don’t I just tweak it a bit more?”

This will work! You absolutely can transform data inside your pipelines, executing joins, applying filtering etc. to get it into a usable form. That’s ETL.

However, I can tell you right now that eventually you’ll have to change how those transformations work. It’s a fact of life. And when you do make that change, you may find it harder than you’d like to apply the transformation to all your historic data. For example, what if you load a batch of data every day? All the old batches will have had the old logic applied. You’d need to reload all of them with the new transformation logic — a backfill.

That’s why the ‘modern’ approach (ELT) is to land the data in a close-to-raw form and transform inside the data warehouse. You don’t necessarily need a special tool to do that, though! A warehouse such as BigQuery allows you to

define views, which are essentially saved SQL queries masquerading as a table
create saved queries that run on a schedule (and which can do things like create/overwrite tables)

This means you have built-in ways to use SQL to transform raw data into useful models. I’m going to go out on a limb and suggest you may even want to do it that way, at least for a few weeks or months.

But eventually you will run into the same problem that everyone does: the colossal tower of views collapses into spaghetti. More on that in a moment, when I talk about transformation tools …

Business intelligence tool

My suggestion here is that you start out with something free, if you can.

In the early days, we used Colab notebooks to create analyses and visualisations. We still use this for our data science work: shareable Python code that runs the same for everyone, that can include documentation, and that can be version-controlled? It’s a good option.

However, you will find yourself reusing a lot of code, and you need people who can write Python, and many less-technical colleagues are not going to enjoy being sent a sharing link for a notebook.

For a long time we used Looker Studio (née Google Data Studio), because

it’s a lot friendlier for less-technical colleagues
it integrates out of the box with BigQuery, so getting started was fast

For example, you can write custom SQL to pull data into the report, then (when you’re happy with the report) put that SQL into a BigQuery view and connect to that instead. The view can then be reused.

These days, we use ThoughtSpot — a self-serve business intelligence tool. Self-serve tools such as ThoughtSpot (or Lightdash if you want an open-source, self-hostable alternative) are certainly an interesting approach for data-hungry startups … but make sure your data are well-modelled (see the other articles in this series!) before you even think about adopting one.

Data transformation tool

So, why do you need this? Let’s say you took my advice so far. You

set up a fully-managed data warehouse
connected some of your sources directly and implemented managed pipelines for the others
hooked up a data visualisation tool that suits your needs
started building SQL views/saved queries to turn your raw data into a form that works for your purposes

The problem is, the last step is ongoing. You get new requests for data all the time. Soon there’s a tonne of views — there’s duplication of SQL everywhere, everything’s interdependent and hard to change, and if you make a bad change it’s hard to fix.

But this should be a solved problem! All of those issues apply to regular software, and we have tools and processes that address them. This is where data transformation tools come in.

At Apolitical, we use DBT to

define our data models using (templated) SQL captured in version control
materialise all our data models as tables on a regular schedule

(We could have used BigQuery views instead of tables to materialise our models. Because views execute the SQL in their definition at the point you query them, this would improve data freshness but impact query latency; for us, the latter is more important.)

DBT has an advantage for startups in that it’s another of those tools where the fully-managed platform can be replaced by a self-hosted version (in which case it instead costs you in compute, engineering time, and not having customer support). Increasingly, you will find that BI tools are designed to integrate with DBT as well as your warehouse, which can reduce friction.

I recommend that if you do adopt DBT, you take some time to read their guide to structuring a DBT project. Remember, we’re trying to impose order on chaos here! You may also want to read their blog on data modelling with DBT, which pairs well with the first article in this series.

How does that stack up?

There are a bunch of choices to be made here, mostly around cost vs. speed.

If you’re willing to pay for the fully-managed versions of all these services, you could have the whole stack up and running (in a minimal way) in less than a week!

You can still move pretty fast, while spending less money, if you have the technical skills to self-host the various parts of the stack — but bear in mind that

the stack is modular
the stack is only useful when you have all the parts in place
your time is also money

So, if you can afford it, it’s better to get things up and running quickly and then switch out the most expensive parts as you go along. If you’re worried about costs, you may wish to focus on open-source options that you could move to self-hosted if need be.

Conclusion

In the first article in this series, I explained why data modelling is important and covered the theory behind it. All the theory in the world is useless if you don’t have a way to put it into practice. In this article I’ve focused on how you can get into a position to run an effective business intelligence operation as quickly as possible, without compromising on the basic approach.

In the next article, I will draw these two threads together, showing you how you can deliver value quickly and continuously using the modern data stack and data modelling.