CI/CD in Dataform with CircleCI part 2 — managing changes in the data model

Daniel Hommrich
Predictly on Tech
Published in
5 min readNov 22, 2021
Photo by Markus Spiske on Unsplash

In my previous article I tried to sketch a CI/CD solution for Dataform using CircleCI in a compact way; I had to leave out some details and obstacles that usually surface along the way. Today, I’d like to explore that a little further by looking at a real deal breaker: how to manage changes in the datamodel.

This has always been a delicate question. Most of the cases, there are more or less advanced SQL scripts that are supposed to do the job, some teams have discovered the power of Liquibase — which unfortunately doesn’t support BigQuery at this point — and some are still scratching their heads.

One of the core objectives of CI/CD is to remove the human factor from releasing and deploying. There shouldn’t be any obscure script that needs to be run and can be forgotten — or is run but against the wrong database or schema. We all have seen it happen. Instead, this should be done in a controlled fashion, reproducible, automated.

Looking at a typical DWH implemented in Dataform: which objects and potential changes do we have? Which are handled automatically and which need some attention? To get a clearer picture I put it all in a table:

As we can see, problems exist with both incremental and partitioned tables. There is a simple solution, though: the “Run with full refresh” option. This means that Dataform will force a complete recreation of all tables and views in the run, building them from scratch — and it actually solves the problems seen above completely:

In order to use the full refresh option in CircleCI, we need to get the Dataform command line interface (CLI) running within our Workspace which is the container in which all commands and jobs of the Workflow are run. For that we need to install it locally, get some configuration in place and finally execute the run. Here’s how this might look like:

Let’s go through some of the most interesting — or least obvious — places in that job.

On line 13, [your_project_id] in the configuration file dataform.json is being replaced with a variable called GOOGLE_PROJECT_ID. This will be relevant if your environments are in different projects. For example, your test environment might be on elegant-elephant-12345 while production is on precious-panda-98765. Depending on where you deploy, you’d have the current project ID in the variable making the job more flexible. After that we actually install the Dataform CLI using npm.

Now we’ve come to the fiddly part, the one I detest: getting the configuration done. I can point you in the direction here, introduce you to some basics, but I’m afraid you’ll have to get the details right yourself for your own setup. Trial and error…

First off is the credentials file required by Dataform. Normally, you’d run the init-creds command but as that requires user input the easiest way is to do this once manually and just copy the contents of the resulting file. Store that in a secret which you then write back into a file just as seen on line 15.

Then there’s the question of some settings to make life with the CLI easier. You’ll want to switch off Anonymous Analytics for these runs in order to avoid the CLI to prompt a question about it and also use an anonymous user ID. This can be hardcoded and written to a settings file as seen on lines 16–19.

After that you’re set! Two more simple commands will install and run the whole project with the full refresh option. So, time for a short recap on what we’ve done so far:

  • we found a solution to apply changes in the datamodel to BigQuery automatically — running the project with the full refresh option
  • we then created a job in CircleCI that sets up an environment to run a dataform project including all configuration
  • and in addition, the developer decides if this job should be run; this is decided on line 11 and the job is only run if the commit message contains the string [full refresh] (depending on the size of your project or tables you might only want to run a full refresh when it’s really necessary)

There are some limitations of this solution to keep in mind. First of all, if you remove a table from the Dataform project, it will not be dropped but remain in your dataset. You would have to drop it manually if you want to save the storage cost. As well, Dataform will not accept any changes to the partitioning key or column nor changes in its datatype. Here, a manual drop of the partitioned table followed by a dataform run does the trick as well, but careful planning in advance is advised.

Another thing to keep in mind is the size of the data. Recreating tables with a few gigabytes every now and then will work out fine but if you for some reason have a situation where you have enormous amounts of data and a huge volatility in your datamodel this won’t work well for you. But then again, off the top of my head I don’t see any solution fitting these two conflicting requirements.

Finally, there’s the obvious but also biggest prerequisite for all of the above: that there is a source that’s persisted and thus permits for repeated complete loads of the data — call it a Data Lake or similar. If you load your data incrementally and drop the source afterwards the described solution will leave you with empty tables! Instead, we work with raw data tables keeping a complete history of the incoming data. That enables the above and can in addition also be the source for other data products that we don’t even see happening today.

Despite the mentioned restrictions I find this solution really handy as it elegantly solves a lot of problems that each data engineer has dealt with, possibly several times. I hope you found some useful information in this article. If so, give it a like and watch out for more to come!

--

--