The SVT Data Platform Journey

Published in

The SVT Tech Blog

5 min readJun 1, 2022

Background
During the Spring 2020, SVT closed down the usage of Adobe Analytics and replaced it with Helix; the new data platform built in-house.
Helix was originally built for two main reasons:

SVT wanted to work more data driven through the whole organisation and Adobe was not flexible enough.
The Adobe price model did not scale well with the increased number of requests sent to track different aspects of SVT products.

*Helix is built and maintained by the SVT DataOps team.*

Early versions of Helix was primarily implemented to collect data in a reliable way. The system was quite robust. To transform the collected data into business entities we used SQL. The SQL was executed directly by Airflow, a tool to manage and schedule workflows. The biggest challenge was (and still is) how to make the data easily available to everyone that wants to draw conclusions from it.

We started out by creating aggregates for the most common analysis needs using the raw event data sent by clients. There were a lot of demands from different stakeholders and as we wanted to help them we started to implement different aggregates in a very fast pace overlooking a bit of structure, reusable models and data quality.

The development cycle was a bit slow due to Airflow and the difficulty of testing the code locally. Aggregates were hard to understand by users. The DataOps team got more questions about data that took a lot of our time to answer.

We now needed to:

Create good documentation in order to free some development time.
Shorten the development cycle.
Build reusable data models and add data quality tests.

dbt to the rescue

We started to look at different tools to address the problem and soon found dbt. dbt (data build tool) is a command line tool based on SQL that enables data analysts and engineers to transform data in their warehouses more effectively. Since we were using Google BigQuery, we also considered using Dataform. Dataform is a tool quite similar to dbt that was acquired by Google, but it proved to immature for us to use at the time.

Using dbt you are encouraged to document your SQL models using YAML files. Given these YAML files dbt can automatically generate good documentation. The generated documentation serves as data dictionary and visualizes data lineage as a graph with dependencies between models and where the data origins from (sources). As the documentation is kept close to the code it’s easier for developers to keep it up-to-date.

The development cycle was shortened due to:

dbt was much more light weight and easier to run locally than Airflow.
Since dbt is based on plain SQL and automatically adds logic for materializing results as tables or views we could develop data models in shorter time compared to using Airflow with SQL and its BigQuery operators.

We also found big benefits from:

Adding dbt data quality tests and sending alarms in case of test failures.
Having one main dbt project that kept all SQL files in one place.
dbt jinja templating and macro capabilities to reuse SQL code snippets.

Time for refactoring

After using dbt for a year or so, having migrated more data models into dbt, the project increased considerably in size and we started to notice an increased maintenance burden. It was hard to recognize patterns and similarities between different data pipelines.

To make it easier to add new features and maintain the code we decided it was time to define a more formal structure for all our data transformation pipelines. Following the “Separation of Concerns” principle we wanted to divide our data transformation pipeline into clear, intermediate steps.

By materializing some key steps as partitioned and clustered tables we would be able to sample data during, for example, troubleshooting or data exploration, making it both faster and cheaper. Cheaper since the cost Google bills you is heavily influenced by the total number bytes processed by BigQuery in order to give results for SQL queries.

Design decisions

The maintainers of dbt has written an article on how they structure their dbt projects. Inspired by it we decided to divide our code into three layers with separate responsibilities.

Source — the source data untouched. “Raw” data on the exact same format that was used in the system or client the data is originating from.
Staging — here we cleanse, rename, cast and enrich the data.
Modelling — model the data to business related concepts.

The modelling layer should only reference staging models. By doing so it will be easier to reuse code and thereby make it less error-prone.We also decided on a naming standard. Every staging model should be named stg_<model> and source models src_<source>. This will hopefully make it easier to understand what you can expect of the data when navigating it in BigQuery.

Output

We have now refactored our dbt models so that it follows our defined structure. Benefits we gained during the work:

The code is now updated so that it follows our SQL styling guide
For every model there’s an associated documentation file that is up-to-date
A lot of good discussions which has led to new learnings and knowledge sharing
It is easier to write tests when the models have clearly defined responsibilities
Our data lineage got much cleaner thereby it’s easier to model dependencies and where data comes from.

*Data lineage for one of our main data models.*

What we want to do next

Scale the dbt usage at SVT by letting other teams/departments create their own dbt projects that are scheduled by Helix.
Improve monitoring and alerting of data quality.