DBT: A new way to transform data and build pipelines at The Telegraph
The Telegraph is a 164-old-company where data has always had a central role. With the advent of the cloud and the need to build a platform able to process a huge quantity of data in 2015, we started to build our big data platform. We decided to use Google Cloud and, since delivering our first PoC, over the years we have kept improving our platform to better support the business.
During the last 4 years, I had multiple discussions on how to handle data transformation or, more extensively, (ETL) Extract Transform and Load processes. The number of tools that you can choose on the market is overwhelming and committing to the wrong technology could have a negative impact on your capabilities to effectively support different business units and drive decisions that are based on reliable figures.
At The Telegraph, the datalake has been built on the top of Cloud Storage and BigQuery and, according to Google, the natural choice to perform ETL or ELT should be Dataflow (Apache Beam). For most companies, this might be true. But when you go outside the general use-cases exposed in the “getting started” guides and you start to relate with real-world challenges, what is supposed to be an easy choice might be not so easy.
In our case, adopting Apache Beam has proven not to be the easiest solution for the following reasons:
- Java SDK is much more supported than Python SDK and most of the members of our team have already large expertise in Python, but they are not proper Java developers. Also, our data scientists only work in Python and it would mean having codebases in multiple languages to support, making it hard to rotate engineers across different projects.
- Most of our batch data pipelines are structured to be just a series of queries to run against BigQuery. Considering this, Apache Beam doesn’t add much value on the ETL process.
- Dataflow connects really well with Google products, but in 2015 the number of connectors was limited and we needed to interact with AWS, on-premise servers etc.
- Our analysts and data scientists tend to speak in SQL and it’s much easier to collaborate with them if, in engineering, we don’t need to translate the SQL logic that they are producing in Java or Python.
As a side note, we adopted Apache Beam in a second phase, but only for real-time data pipelines. In fact, on this specific case having a windowing strategy and being able to perform operations on a stream of records was paramount for the success of certain projects.
The second product you might be tempted to use if you are using Google Cloud Platform (GCP) might be Dataproc. If you already have a Spark cluster or Hadoop in place and you want to migrate on GCP, it would make perfect sense to consider this option. But in our case, we only had a small Hadoop cluster and to rewrite the logic of the pipelines that were running there was not a problem.
The third product that we considered and even used for a while is Talend (free version). If your company wants to fully commit to Talend and buy its enterprise version then it is a great choice, but if you don’t have a strong enough case and you decide to adopt the free version, you might face some of the following challenges:
- It is hard to apply version control to your data pipelines.
- You have to come up with your own CI/CD and the testability of your artifacts is limited.
- You have to rely on components provided by the community that might become outdated and eventually not supported. Or you might decide to develop your own components committing resources on building and keeping them up to date.
- You have to find Talend experts who are able to drive the development, enforcing best practices to produce high-quality pipelines.
- If the decision to adopt the tool to perform ETL is not shared with the entire company you could end with a scattered set of technologies that you have to maintain in a few years that are performing similar tasks.
For the reasons above we considered building our own Python ETL library as a wrapper of functionalities provided by Google and AWS in order to make our life easier when interacting with the main services of both clouds. Even this approach has proved to be far from perfect. Due to the effort required to design and develop our own library and all the maintenance required to keep it updated and include new features, we started to look for something that could integrate well with this approach and reduce the scope of the library.
In June 2019, we started to test DBT for the transformation part with the idea of continuing to perform Extraction and Load using the Python library and relying on Apache Beam for real-time data processing.
What is DBT?
DBT (Data Building Tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses simply by writing select statements.
DBT performs the T (Transform) of ETL but it doesn’t offer support for Extraction and Load operations. It allows companies to write transformations as queries and orchestrate them in a more efficient way. There are currently more than 280 companies running DBT in production and The Telegraph is among them.
DBT’s only function is to take code, compile it to SQL, and then run against your database.
Multiple databases are supported, including:
DBT can be easily installed using pip (the Python package installers) and it comes with both CLI and a UI. DBT application is written in Python and is opensource, which can potentially allow any customization that you might need.
The CLI offers a set of functionalities to execute your data pipelines: run tests, compile, generate documentation, etc.
The UI doesn’t offer the possibility to change your data pipeline and it is used mostly for documentation purposes.
In the image below you can see how data lineage of a certain table is highlighted in DBT UI. This helps to quickly understand which data sources are involved in a certain transformation and the flow of the data from source to target. This type of visualisation can facilitate discussion with less technical people who are not interested in the detailed implementation of the process, but want an overall view.
Initialising a project in DBT is very simple; running “dbt init” in the CLI automatically creates the project structure for you. This will ensure that all engineers will work with the same template and thereby enforces a common standard.
DBT also offers the maximum flexibility and if, for some reason, the project structure produced doesn’t fit your needs, it can be customised by editing the project configuration file (dbt_project.yml) to rearrange folders as you prefer.
One of the most important concepts in DBT is the concept of model. Every model is a select statement that has to be orchestrated with the other models to transform the data in the desired way. Every model is written using the query language of your favourite data warehouse (DW). It can also be enriched using Jinja2, allowing you to:
- Write tidier parameterised queries.
- Encapsulate any functionality that you might want to recycle in macros that can be used as functions in your queries.
- Hide the complexity of your transformation to allow the reader to focus on the logic itself.
Below an example of a model using BigQuery Standard SQL syntax.
In this specific case, Jinja is used to inject in the output a set of technical rows that are not present in the source. Jinja allows iteration on the matrix of values and reshapes each row in a way that can be included in the main table, enabling you to keep your query more concise.
The output of each model can be stored in different ways, depending on the desired behavior:
- Materialise a table — full refresh
- Append to a table — incrementally build your output
- Ephemeral — the output will not be stored in your DW but it can be used as a data source from other models.
Every DBT model can be complemented with a schema definition. This means that the documentation lives in the same repository as your codebase, making it easier to understand every step of what has been developed. Also, since the documentation is always in front of the engineers who are working on the pipelines and since it is directly generated from the codebase, minimal maintenance is required to keep it up to date.
In data-driven companies, ensuring the quality of the data delivered to the business is always of paramount importance. DBT helps to serve high-quality data, allowing you to write different typologies of tests to check your data.
Simple tests can be defined using YAML syntax, placing the test file in the same folder as your models.
For this specific example, the defined tests are:
- sk_interaction, bk_source_driver are both assuming unique values and are never null
- count_interactions is never null
- fk_interaction_detail is not null, assumes unique values and all the foreign keys fk_interaction_detail are able to join with the surrogate keys sk_interaction_detail in the dimension interaction detail. This is called referential integrity test and helps to ensure that your star schema has been built in the proper way.
- fk_interaction_text has similar test criteria.
- Performance_band can assume only the defined array of values.
More advanced testing can be implemented using SQL syntax.
The query below ensures that the “bk_source_driver” field from model “fact_interaction” doesn’t have more than 5% of the values set as NULL.
Models on DBT rely on the output of other models or on data sources. Data sources can be also defined using YAML syntax and are reusable and documentable entities that are accessible in DBT models.
In the example below, you can see how it is possible to define a source on top of a daily sharded BigQuery tables. It is also possible to use variables to dynamically select the desired shard. In this specific case, the “execution_date” variable is passed in input to DBT and defines which shards are used during the transformation process.
DBT offers also the possibility to write your own functions (Macros) these can be used to simplify models but also create more powerful queries adding more expressive power to your SQL without sacrificing readability.
The macro in the example below is used to unite multiple daily shards of the same table depending on the “execution_date” variable passed and the number of past shards we want to take into consideration.
The Telegraph’s data engineering team has tested DBT (Core version) for the past two months and it’s proved to be a great tool for all of the projects that required data transformation. As a summary of our experience, here is a list of the tool’s pros and cons.
- It is Opensource and open to customization.
- It is easy to apply version control
- The documentation lives with your DBT project and it is automatically generated from your codebase.
- It doesn’t require any specific skills on the jobs market. If your engineers are familiar with SQL and have a basic knowledge of Python, that’s enough to approach DBT.
- The template of each project is automatically generated running DBT init. This enforces a standard for all of our data pipelines.
- All of the computational work is pushed towards your DW. This allows you to attain high performance when using a technology similar to BigQuery or Snowflake.
- Because of the point above, orchestrating a DBT pipeline requires minimal resources.
- It allows you to test your data (schema tests, referential integrity tests, custom tests) and ensures data quality.
- It makes it easier to debug complex chains of queries. They can be split into multiple models and macros that can be tested separately.
- It’s well documented and the learning curve is not very steep.
- SQL based; it might offer less readability compared with tools that have an interactive UI.
- Lack of debugging functionalities is a problem, especially when you write complex macros.
- Sometimes you will find yourself overriding DBT standard behavior, rewriting macros that are used behind the scenes. This requires an understanding of the source code.
- The UI is for documentation-only purposes. It helps you to visualise the transformation process, but it’s up to your data engineers to keep the DBT project tidy and understandable. Having an interactive UI that allows you to visually see the flow of the pipeline and amend the queries can be helpful, especially when it comes to complex data pipelines.
- Documentation generation for BigQuery is time-consuming due to a poor implementation that scans all of the shards inside a dataset.
- It covers only the T of ETL, so you will need other tools to perform Extraction and Load.
Stefano Solimito is a Principal Data Engineer at The Telegraph. You can follow him on LinkedIn.