CI/CD for data projects: Why manual deployments are not good enough

Published in

datamindedbe

7 min readSep 23, 2022

Data engineering is slowly but surely adopting many of the best practices from Software engineering. Two of these best practices are Fast Iteration Cycles and Continuous Integration/Deployment.

In order to efficiently and confidently deliver any kind of data project, I find it indispensable to have the following:

Being able to do fast development iterations: while I do enjoy getting a coffee with my colleagues, I’d rather not have to wait 10 minutes every time I make a change to my data processing code to see the results.
Being able to work with realistic data, without having to (unlawfully?) get data locally on your machine.
Being able to eliminate the “it works on my machine” syndrome. The gap between running your data pipeline locally versus in production is way too often a blocker.
Being able to ship to production frequently and with peace of mind that everything is working as expected. This is what CI/CD is all about.

If you want to know more about the typical use of CI/CD in data projects from ideation to production, I highly suggest you follow our webinar on CI/CD for data projects on 12th October at 12.30 CEST.
You should watch this webinar if your are building data projects of any kind (ETL, Spark, Dbt, ML, … etc.) and want to:
🛠️ Reliably build, test and deploy your projects
⚡ Have a flexible yet robust way to standardize your CI/CD pipelines for data projects
🍡 Know about the best practices in the CI/CD space for data projects

At Data Minded, we are building Conveyor which, among other things, aims to make those challenges easier to tackle. In this article, we will see how Conveyor achieves that. Furthermore, we will see how Conveyor brings consistency to your CI/CD pipelines across all your data projects, even though they are all unique.

Conveyor brings your data project from ideation to production and brings consistency to your data projects

Fast development iterations

We all know software development is an iterative process. First-time-right is rarely a thing, especially when it comes to data processing. A very common problem that we see data engineers struggling with is that they need to write their data pipeline logic without having convenient access to the data itself.

They are then torn between downloading some data on their machine, which has all kinds of security and legal implications, or accepting the slow iteration cycle: build, deploy, schedule, wait for execution, and check the logs/output. This typically involves interacting with various tools and quickly becomes frustrating, which explains why downloading data locally can be so tempting.

Conveyor solves this conundrum with its CLI conveyor run command. This single command triggers a local docker build and validates your project code. Then it will run the container on your Conveyor cluster while outputting the resulting logs in your terminal as if your code was run locally. This is definitely one of the favourite Conveyor features of our users.

It achieves a few things:

it solves the data access problem: since the data pipeline is run on the cluster, it leverages the same data access mechanisms as the production pipelines would.
no more “it works on my machine” since it actually already “works on your cluster”. No more surprises when actually deciding to deploy to production.
conveyor run validates your code, your Airflow DAG and your Docker build, giving you the confidence you won’t be breaking everything before opening your pull requests.

Conveyor run: one simple command for a fast iteration cycle

CI/CD: Ship your changes to production with peace of mind

Once you have fast iteration cycles during development, having continuous integration and deployment pipelines is the natural next step. You feel confident that your changes will work because you just manually tested them on your cluster, from the comfort of your laptop, with conveyor run. However, you still want unit tests, integration tests, and other validations (e.g. formatting, linting, …) to be enforced before you merge your code into your git repository and deploy the changes to production. Also, you most likely have different environments to go through before landing in production (e.g. development, staging, …).

Let’s be clear here, Conveyor itself is not a CI/CD system, but it does make it very easy to build consistent CI/CD pipelines, thanks to its CLI commands.

Data projects come in many different flavours. Ranging from machine learning applications to SQL-based ETL pipelines (dbt anyone?) or complex data processing jobs (Spark, …). Data projects rely on a wide variety of languages, frameworks, and tools. This typically makes it hard to have homogeneity in the CI/CD pipelines across teams working on those different projects.

But do not despair, all data projects also have much in common. Each project needs to be built, deployed, orchestrated, executed, and monitored. Conveyor acknowledges this and captures this workflow with a few simple and consistent CLI commands.

Leveraging containers for building and shipping your code in a repeatable way. Orchestrating these data jobs using Airflow and executing them on Kubernetes, Conveyor lets you abstract away all the things that make each data project unique. This lets you build CI/CD pipelines across teams and projects which always follow the same recipe:

Conveyor standardises your data projects CI/CD pipelines

`conveyor build`

The first step in your CI/CD pipeline is naturally to build your project with

conveyor build

This command locally validates your project DAG(s) first, in order to catch orchestration issues early and therefore avoid invalidating your Airflow environments.

It generates a unique build id linked to the git hash of your code, making your future deployment traceable to a given state of your code, which you can always revert to if needed.

Unit testing and other validations

Your CI pipeline now makes sure your project can be built reliably. A good practice at this stage is to add unit testing and various linting and validation steps (code formatting, company/team conventions/…).

conveyor deploy

Once your project is built and validated, it is time to deploy it to an environment. Typically, you would have your CI/CD pipeline deploy your project to some kind of staging environment first, which can easily be achieved with

conveyor deploy --env stg

A few seconds later, your new code is deployed and ready to run in your stg environment.

Integration tests

Now that your new code is deployed on your stg environment, it might be a good time to have some integration tests. If you do not have access to proper data in this environment, you can also postpone this step to a later stage, when your project is deployed in your prod environment.

conveyor promote

Once your new code has landed in stg and has passed the various integration tests, it is time to finally ship it to production. With Conveyor, it is as easy as this:

conveyor promote --from stg --to prod

Note that, ideally, your production environment should not be touched by anyone. Human interactions stop at the staging level, and then production is a perfect mirror of it.

Conveyor ships with Role-based access control (RBAC), which makes it very easy to prevent anyone except your CD pipeline to promote your project to your production environment.

But, could I not build all of this myself?

In the last few years, CI/CD has become a commodity. The vast majority of software companies are using some kind of integration pipeline. A decade ago, many organisations would build some kind of homemade solution for this, but nowadays, there is no shortage of SaaS for CI/CD (CircleCI, Github actions, …). It is probably fair to say that almost no one would consider building their own CI/CD infrastructure today.

At Conveyor, we think the same should be true with data runtime and orchestration infrastructure. And yet, we see companies building their own everywhere, often creating frustration and losing their target along the way. We strongly believe that data organisations should focus on delivering their business objectives, instead of building their own data platform. (For more details, you can read our “Why I won’t build my next data platform myself” blog post) or, check the webinar here.

Conveyor aims to be this runtime and orchestration infrastructure, nothing less and nothing more. You get to keep your freedom about which technologies fit best for any given data product, and yet, you will get a unified model for building, deploying, running, and monitoring all your projects.