How To Build a Modern Data Pipeline

Patrik Braborec
GoodData Developers
13 min readAug 15, 2022

In my previous article, How to Automate Data Analytics Using CI/CD, I demonstrated how to automate data analytics using CI/CD. I used Dagger to build a pipeline that is agnostic to CI/CD vendors and I used GoodData as an analytics tool. I found the topic very interesting, so I dived deeper into it and read some articles.

What inspired me a lot was an article from dbt about the role of the analytics engineer (AE). The article describes how the AE provides clean, transformed data ready for analysis and applies best software engineering practices to the analytics code. It looks and sounds excellent, but let’s break it down. Why do you need to apply the best software engineering practices to analytics in the first place? As we see, more and more things in analytics are possible to do with code, and if we do not want to end up in a complete mess, we need to apply the best software engineering practices!

In this article, I will describe the most significant problems AE needs to deal with and I will show you the possible solutions to these problems. Although I will present you with a possible solution, there is still going to be a lot of room left for improvement and discussion — your feedback is encouraged. Let’s deep dive into it!

Typical Problems To Solve for Analytics Engineers

Engineers build the best tools when they solve real problems, and as I mentioned, there are some exciting problems to solve here. I will try my best to describe how to solve them in this article, and I hope you will find it helpful.

Problem 1: Change and Test of Models

Let’s imagine that you use a 3rd party API to fetch data and store it in the database. A situation arises where you are forced to change the API to another provider, which will most likely have a different data structure. Changing the API provider may be fairly trivial, but you cannot be sure if the change will have a negative influence on the data used by the analytics. For example, your data model contains a column age that must be both an integer and a positive number, and you have built a metric in analytics based on that column to compute the average age. If the new API provider is buggy and returns age in negative numbers, you won’t notice the problem until people start reporting to you weird results displayed in your analytics. You do not want to be liable for errors made downstream from your faulty analytics. If we apply software engineering best practices and test our data & data models properly, we can ensure such problems won’t slip under the radar.

Problem 2: Test of Analytics

Those of you who have a solid background in software engineering know that every code change that goes to production should be properly tested. You do not want critical bugs in production and angry customers. Nowadays, there is best practice in software engineering, so-called continuous integration, and continuous delivery. Every change of code on its way to production is properly tested, and if managed correctly, the production should never break. The same principle applies to analytics. You do not want to deploy blatantly broken analytics and frustrate your userbase. Therefore every change of analytics should be properly tested.

Problem 3: Deploy of Analytics

Even if you test data and analytics, you have to continuously deploy new analytics to customers or internal users. You probably do not want to build new metrics and visualizations in the production environment but ideally, once the analytics is ready, deploy everything to the production environment at once. Also, you usually do not want to deploy it manually but have it CI/CD pipeline (for example) — again, apply the software engineering best practices.

The Solution to Analytics Engineers’ Problems

TL;DR: The solution is contained within a single pipeline that my colleague and I have built, and that you can check in our public repository on GitLab.

First we needed data, so we wrote a custom script that crawls through the GitHub REST API (for example, you can analyze GitHub data and build a simple dashboard for your team) and loads the data into a PostgreSQL database. PostgreSQL is not essential for the problem — you can use any other solution, for example, Snowflake. For data transformation and testing, we used dbt, and for the analytics, we used GoodData. As mentioned, the solution is one pipeline, and it made perfect sense to use GitLab CI/CD.

Before we deep dive into single pipeline jobs, let me elaborate on the idea of pipeline workflow. In the whole pipeline we use the concept of pre-merge and post-merge phases. If someone creates a merge request, the pre-merge phase validates that everything works correctly. Also, there is a mandatory code review (again, aligned with engineering best practices). The code reviewer can also validate the result in the staging environment in the UI, and once the review process is complete the reviewer can complete the merge request — the new version is delivered to the production environment only if all these checks are passed.

Extract and Load

We crawl and load organization-level entities (users) for each organization, and for each repository in each organization, we do the same with repository-level entities (commits, pulls). It should be easy to extend it by additional entities in the future. We collect all pages from the paged API. We collect a default timestamp only for rows that are newer (last updated) than the timestamp — it is preparation for incremental loads.

Transform

If you want to transform data from the database, dbt is a perfect tool for this task — it stands for T in ETL/ELT pipelines. You can easily create models you later connect to analytics, and dbt helps you create, test, and document everything alongside applying the best software engineering practices.

Analytics

GoodData lets you create consistent metrics and dashboards with the possibility to access them using code — GoodData Python SDK. It is a huge benefit to deploy analytics in a CI/CD pipeline — analytics as a code. Also, you can manage the analytics using code.

Configuration of the Pipeline

GitLab is the primary DevOps platform that enables you to run a fast and free pipeline (for demo purposes). Using one YAML file, you can easily configure the whole pipeline.

Modern Data Pipeline — Implementation

The whole pipeline is a series of Python and SQL scripts (if you do not count one YAML file for the pipeline definition) and it’s great because you can version these scripts in git. The single version of the end-to-end data pipeline means that all phases of the pipeline are consistent, and it means you can revert to any (consistent) version. The whole pipeline has four stages. To avoid overloading you with unnecessary details I will describe only the most important things, but if you are really interested in the details, check out the repository with the source code.

Before we dive into the implementation: If you want to implement the whole pipeline with GoodData analytics, please register the GoodData Trial account where we will build analytics.

Extract and Load

The first stage of the pipeline is to extract data from GitHub REST API and store it in the database to specific schema cicd_input_stage.

There are two crucial methods worth talking about. We wrote a new class Extract for extracting data that contains some settings and helper methods, but the most important is the following where the magic happens — the method fetches all endpoints configured in config.yaml file and returns result with data to the main method, which later calls the method that writes data to JSON.

With the data in JSON we can quite simply load everything into the database. We write two new classes (class Load and class Postgres) that help us achieve it. For simplicity, I will show just the method that does the important job, which is loading data in the database:

According to the implementation, we recreate tables only once in the beginning, and then we load each table for each organization. The structure of the cicd_input_stage is following:

As you can see, each table contains only a single JSON column. This is not very convenient for analytics as we need to be able to see more details about each item. Let’s transform data in the next stage!

Transform

The dbt is a perfect fit for our problem (JSON columns) in cicd_input_stage. The result of this stage will be cicd_output_stage, where the data model is ready for analytics (the full schema is at the end of this section). In dbt, you write models in SQL files; every model is a select statement. For example, the model that transforms data from table users from cicd_input_stage looks as follows:

You can see that the simple script extracts from JSON properties we are interested in but the final table (or view — it depends on how you configure your dbt project) will contain COLUMNS with types as INT and TEXT which is much better for our analytics purposes. You can write a bunch of these models that work for your analytics purposes, and you will have everything versioned in git. Another benefit is that everyone who knows dbt will immediately understand your project — one of the biggest advantages of dbt is that it brings order to the chaotic environment with a way to generate documentation for your dbt project and render it as a website. Let’s also mention that you can simply start testing your models. You can define a schema file, which is a YAML file that contains “rules” that the schema must follow. For example, the user_id must always be the primary key:

You can see that the test is simple. If you are interested in more complicated tests, I encourage you to check the dbt documentation. The result of this stage is the schema cicd_output_stage which looks as follows:

Analytics — Staging

I will not be describing how to create analytics in GoodData here. If you want to learn more about how to work with GoodData, I encourage you to check the documentation.

You can understand staging analytics as a test environment where you can prepare all your analytics and consult it with stakeholders. Once the analytics is ready, you can deploy it to the production environment (as described in the following section).

The result of the previous stage was the schema cicd_output_stage in the database. We need to connect GoodData to the schema, create a semantic model (Logical Data Model in GoodData language), and create visualizations and dashboards. You can do that in the UI and manage that using the mentioned GoodData Python SDK. The folder analytics in the repository contains examples of how to manage GoodData using code that is also part of the pipeline. In this section, we will describe just the most exciting thing — the testing of analytics. The script that will follow is testing visualizations (also known as insights) that have been created in GoodData. If you have visualization in GoodData, you can run the following script as part of the pipeline:

The script tests if it’s possible to execute (or compute) all your visualizations. It helps verify if you can deploy changes to the production environment, because if you know that all visualizations are executable, the analytics is more likely to be correct. With the GoodData Python SDK, you can write a bunch of these tests, for example, if dashboards contain all required visualizations, etc. The script also contains environment variables that are loaded from GitLab CI/CD variables (in the pipeline, you can also define them locally, if you want). Now, with staging analytics, we just need to deploy everything to the production environment. Before we continue, let’s just look at what the analytics look like in GoodData:

Analytics — Production

Now that the staging analytics is ready, we need to deploy everything to the production environment. Thanks to GoodData Python SDK, this takes just a few lines of code:

The script does two things — first, it copies everything from staging analytics (the analytics model — visualizations, metrics, dashboards, etc.) and puts everything into the production environment. Once the analytics is in the production environment, it can be consumed by all stakeholders, and thanks to all the previous steps, you can be reasonably certain that everything is correct.

Configuration of the Pipeline

The last step is configuring the whole pipeline in GitLab CI/CD. The whole pipeline has four steps:

Three of them (extract_load, transform, and analytics_staging) run before the merge — for example, if you create a new merge request in the GitLab repository. The last one (analytics_prod) runs after the merge (code review included). Let’s take a look at the first stage, extract_load:

The most important thing is the script that installs all dependencies and then runs two scripts described in the Extract and Load Data section. You can also see the rules — it runs in merge requests if the target branch is main, or it can run as scheduled stage thanks to GitLab scheduled pipelines. Also, if we do not make any changes in the extract_load folder, the stage does not run.

For simplicity, I will not describe the configuration of the whole pipeline, but you can find the whole pipeline in our repository and find more information in README.md. Let’s take a look at one more stage (for staging analytics as described in GoodData — staging):

The stage only runs scripts, but the scripts do interesting things. It starts with the registration of the data source and ends up with tests of analytics. Also, notice that thanks to rules, the stage runs only if we change something in the analytics folder where all the scripts are.

Modern Data Pipeline — Demonstration

The pipeline is now ready and we can demonstrate how it works! I will create a simple merge request where I make some trivial change in every folder, to run the whole pipeline. If you want to see details, the merge request is available here — you can also check the details and logs from single jobs. Once I created the merge request, the pipeline starts to run:

After a while, stages extract_load, transform, and analytics_staging passed:

Once we are ready to deploy everything to the production, we can merge the merge request, and in the last stage, analytics_prod will run:

The result is fresh new, and tested analytics in the production environment:

Also, as we mentioned before the scheduled pipeline, you can see how it looks in GitLab (but only stages extract_load and transform):

The detail of the pipeline looks as follows:

Final thoughts

Let me elaborate on a few more problems and opportunities we discovered during the design and implementation of the pipeline that we did not yet cover. We believe that every analytics engineer will come across the same problems and opportunities sooner or later.

Pipeline Versus UI

Imagine that you deploy everything to the production environment, and everyone is happy. Your stakeholders have started using analytics, and some have made changes (for example, a new dashboard). You discover a bug and need to rerun the whole pipeline. The current solution will not reflect changes in the production analytics and will replace it with staging analytics. One of the possible solutions is the so-called workspace hierarchy in GoodData. Workspace hierarchies enable the delivery of analytics from the pipeline into the parent workspace and let users create their analytics in child workspaces (parent analytics is read-only) — for more details, check the documentation. Let us know if you are interested in the solution!

Incremental Loading of Data

Every time we run the extract_load stage, we fetch everything from the GitHub REST API. This is unnecessary — we just need the new data, not the old ones we already have. One of the solutions is, for example, to have a table in the database where you keep the state of your data (last time you fetch data), or you can use Airbyte that already has a so-called Incremental Sync. Also, you need to change dbt models to transform data incrementally. Let us know if you are interested in the solution!

End-to-end traceability

We plan to utilize GoodData Python SDK to return dependencies among all analytics objects (metrics, visualizations, dashboards, etc.). We plan to ingest these dependencies into dbt exposers and generate an end-to-end dependency graph, starting from the input stage (if we consider this article, it would be cicd_input_stage) and ending with dashboards. Thanks to traceability, you would be able to, for example, tell if any change of an analytics object will break something along the way before you do the change. This will further help you maintain the analytics.

Separate Staging and Production Database

Currently, everything runs on one database (we can say production database), which can be dangerous. It is good practice to have a staging database where you run most of the jobs that are part of the pipeline.

Thank you for reading the article. I hope you find it helpful! You can check the repository if you want to see scripts and the solution. Please follow us to receive more similar articles about analytics problems, ideas, and solutions. If you have any questions, do not hesitate to contact us! 🚀

--

--