DBT: A new way to transform data and build pipelines at The Telegraph

Stefano Solimito
Aug 23, 2019 · 10 min read

The Telegraph is a 164-old-company where data has always had a central role. With the advent of the cloud and the need to build a platform able to process a huge quantity of data in 2015, we started to build our big data platform. We decided to use Google Cloud and, since delivering our first PoC, over the years we have kept improving our platform to better support the business.

The challenge

During the last 4 years, I had multiple discussions on how to handle data transformation or, more extensively, (ETL) Extract Transform and Load processes. The number of tools that you can choose on the market is overwhelming and committing to the wrong technology could have a negative impact on your capabilities to effectively support different business units and drive decisions that are based on reliable figures.

At The Telegraph, the datalake has been built on the top of Cloud Storage and BigQuery and, according to Google, the natural choice to perform ETL or ELT should be Dataflow (Apache Beam). For most companies, this might be true. But when you go outside the general use-cases exposed in the “getting started” guides and you start to relate with real-world challenges, what is supposed to be an easy choice might be not so easy.

In our case, adopting Apache Beam has proven not to be the easiest solution for the following reasons:

  • Java SDK is much more supported than Python SDK and most of the members of our team have already large expertise in Python, but they are not proper Java developers. Also, our data scientists only work in Python and it would mean having codebases in multiple languages to support, making it hard to rotate engineers across different projects.

As a side note, we adopted Apache Beam in a second phase, but only for real-time data pipelines. In fact, in this specific case having a windowing strategy and being able to perform operations on a stream of records was paramount for the success of certain projects.

The second product you might be tempted to use if you are using Google Cloud Platform (GCP) might be Dataproc. If you already have a Spark cluster or Hadoop in place and you want to migrate on GCP, it would make perfect sense to consider this option. But in our case, we only had a small Hadoop cluster and to rewrite the logic of the pipelines that were running there was not a problem.

The third product that we considered and even used for a while is Talend (free version). If your company wants to fully commit to Talend and buy its enterprise version then it is a great choice, but if you don’t have a strong enough case and you decide to adopt the free version, you might face some of the following challenges:

  • It is hard to apply version control to your data pipelines.

For the reasons above we considered building our own Python ETL library as a wrapper of functionalities provided by Google and AWS in order to make our life easier when interacting with the main services of both clouds. Even this approach has proved to be far from perfect. Due to the effort required to design and develop our own library and all the maintenance required to keep it updated and include new features, we started to look for something that could integrate well with this approach and reduce the scope of the library.

In June 2019, we started to test DBT for the transformation part with the idea of continuing to perform Extraction and Load using the Python library and relying on Apache Beam for real-time data processing.

What is DBT?

DBT (Data Building Tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses simply by writing select statements.

DBT performs the T (Transform) of ETL but it doesn’t offer support for Extraction and Load operations. It allows companies to write transformations as queries and orchestrate them in a more efficient way. There are currently more than 280 companies running DBT in production and The Telegraph is among them.

DBT’s only function is to take code, compile it to SQL, and then run against your database.

Multiple databases are supported, including:

  • Postgres

DBT can be easily installed using pip (the Python package installers) and it comes with both CLI and a UI. DBT application is written in Python and is opensource, which can potentially allow any customization that you might need.

The CLI offers a set of functionalities to execute your data pipelines: run tests, compile, generate documentation, etc.

The UI doesn’t offer the possibility to change your data pipeline and it is used mostly for documentation purposes.

In the image below you can see how data lineage of a certain table is highlighted in DBT UI. This helps to quickly understand which data sources are involved in a certain transformation and the flow of the data from source to target. This type of visualisation can facilitate discussion with less technical people who are not interested in the detailed implementation of the process, but want an overall view.

Initialising a project in DBT is very simple; running “dbt init” in the CLI automatically creates the project structure for you. This will ensure that all engineers will work with the same template and thereby enforces a common standard.

DBT also offers the maximum flexibility and if, for some reason, the project structure produced doesn’t fit your needs, it can be customised by editing the project configuration file (dbt_project.yml) to rearrange folders as you prefer.

One of the most important concepts in DBT is the concept of model. Every model is a select statement that has to be orchestrated with the other models to transform the data in the desired way. Every model is written using the query language of your favourite data warehouse (DW). It can also be enriched using Jinja2, allowing you to:

  • Write tidier parameterised queries.

Below an example of a model using BigQuery Standard SQL syntax.
In this specific case, Jinja is used to inject in the output a set of technical rows that are not present in the source. Jinja allows iteration on the matrix of values and reshapes each row in a way that can be included in the main table, enabling you to keep your query more concise.

The output of each model can be stored in different ways, depending on the desired behaviour:

  • Materialise a table — full refresh

Every DBT model can be complemented with a schema definition. This means that the documentation lives in the same repository as your codebase, making it easier to understand every step of what has been developed. Also, since the documentation is always in front of the engineers who are working on the pipelines and since it is directly generated from the codebase, minimal maintenance is required to keep it up to date.

In data-driven companies, ensuring the quality of the data delivered to the business is always of paramount importance. DBT helps to serve high-quality data, allowing you to write different typologies of tests to check your data.

Simple tests can be defined using YAML syntax, placing the test file in the same folder as your models.

For this specific example, the defined tests are:

  • sk_interaction, bk_source_driver are both assuming unique values and are never null

More advanced testing can be implemented using SQL syntax.

The query below ensures that the “bk_source_driver” field from model “fact_interaction” doesn’t have more than 5% of the values set as NULL.

Models on DBT rely on the output of other models or on data sources. Data sources can be also defined using YAML syntax and are reusable and documentable entities that are accessible in DBT models.

In the example below, you can see how it is possible to define a source on top of a daily sharded BigQuery tables. It is also possible to use variables to dynamically select the desired shard. In this specific case, the “execution_date” variable is passed in input to DBT and defines which shards are used during the transformation process.

DBT offers also the possibility to write your own functions (Macros) these can be used to simplify models but also create more powerful queries adding more expressive power to your SQL without sacrificing readability.

The macro in the example below is used to unite multiple daily shards of the same table depending on the “execution_date” variable passed and the number of past shards we want to take into consideration.

Conclusions

The Telegraph’s data engineering team has tested DBT (Core version) for the past two months and it’s proved to be a great tool for all of the projects that required data transformation. As a summary of our experience, here is a list of the tool’s pros and cons.

Pros:

  • It is Opensource and open to customization.

Cons:

  • SQL based; it might offer less readability compared with tools that have an interactive UI.

If you are interested in practical tips to get the best out of DBT have a look at this series of articles:
Part 1: https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-8cfa21ef97c5
Part 2: https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-2-a3581c76723c
Part 3: https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-3-38cefad40e59

Stefano Solimito is a Principal Data Engineer at The Telegraph. You can follow him on LinkedIn.

The Telegraph Engineering

The Telegraph Digital Engineering and Product team powering…