How do we automate data delivery using classical CI/CD tools?

Ivan
4 min readJul 24, 2022

My last stories regarding how we organized the data stack in Scentbird caused several questions. One of the most popular questions is about all that moving parts of our stack (let’s call the whole batch process as meta-pipeline) in CI/CD process. Let’s take a closer look!

First of all, let’s split the meta-pipeline in parts. Our meta-pipeline is classical ELT process. It means actions are done as Extract->Load->Transform. Extract and Load (EL) processes are done in one pipeline and Transform (T) is done in another.

EL is done at our side by Argo + Singer (docker template for using both tools could be found here). Argo performs orchestration and Singer do the main job: dump all the data to AWS S3 and import it into Snowflake database after. All the magic happens thanks to the Argo’s feature — dependency graph. It means that any job could depends on result of one or several other jobs. All you need is specify list of jobs and if any of them depends on another. It could be done via dependsOn instruction:

- name: job1
templateRef:
name: workflow
template: job-template
depends: job-depends-1 && job-depends-2

Argo will build the proper workflow on its own taking into account dependencies. All possible jobs will be started in parallel automatically. You can take a look at our DAG built by argo.

That is it! I would only highlight two aspects because they are important:

  1. The whole workflow starts with Cron (we run the pipelines every night).
  2. All jobs coming the only one in the end (it call dbt-run in our case).

Before giving you an idea of what is dbt-run job is, I want to cover the second part of our meta-pipeline: the T process.

All the transformation is done by DBT tool. As DBT organises data-models as a code we try to fit our CI/CD process on data transformation in Gitlab pipeline. After several iterations we come up with such steps in our Gitlab dbt-pipeline:

  1. Run data models on stage schema DWH on any code branch.
  2. Run DBT tests with stage schema on any code branch.
  3. Run data models with production schema on master branch.
  4. Generate DBT documentation and publish it online on master branch.
  5. Run DBT tests with production schema on master branch.
  6. Clone production data to stage schema (master to master).

How that steps are running? We also shaped some rules:

  1. Every commit with change in DBT data model triggers start of step 1 and step 2. So your code will be validated on test schema in DWH(it is equal to production one).
  2. Every morning once EL-process is finished steps 3–5 are triggered to deliver fresh data-model+fresh data to production schema.
  3. Once step 5 is completed step 6 is triggered to make a new fresh clone to stage schema. It helps us to run the development process and check new code with the latest data possible.

Now we are ready to melt EL and T processes. T-process is actually triggered by Argo job also, once EL-process is finished. That job is a really simple python script using GitLab API. The only thing it is doing — finding the latest successful dbt-pipeline in Gitlab and triggers steps 3–6. The whole schema of data ci/cd process:

In a conclusion, I would say that approach is working quite well. However, there are some limitations:

  1. It would fit you in case you could clone production schema into stage schema in minutes, not hours. We have a positive experience with Snowflake (it has a clone feature that allows you to get a copy without actually create a copy). The same trick does not work with Redshift: you have physically copy the data into another cluster and scale it accordingly. It took us 2–3 hours and involves some manual operations to adjust the process. You can fix it via separate schema in the same cluster, however, it will bring additional CPU load to prodution cluster, so could be dangerous.
  2. Your DBT run does not take hours also. You should be able to run it on every commit.
  3. You have some rules of the way your data model amended. Any changes happens directly in production data could lead to failed run, because you have no chances to test it in stage. It looks quite essential in terms of software development, but it becomes tricky in terms of data.

I would also say neither data models development nor transformation of fresh data portion does not perfectly fit into Gitlab CI/CD (if you have seen the perfect one — just let us know). Gitlab was created for the classical software development cycle. Building CI/CD for data pipelines in GitLab (or any other similar tool) delivers hidden issues you have to deal with.

--

--