Why do we use GoCD for running our data pipelines

9 min readJan 17, 2020

Data Engineering is a hot topic in the industry these days. Clear patterns are emerging on how to architect data pipelines and best practices around them. In this article, I will share how we made our data pipelines more reliable and dev-friendly using GoCD. About me: I am Azhagu and I work as a Principal Engineer for the Platform Engineering team at EyeEm in Berlin.

Who are we?

EyeEm, as some of you may know, is one of the largest photography communities around the world. We have around 25 million photographers who have shared more than 130 million photos on our platform. We have a website and mobile apps.

We also have a market place where photographers can sell their photos, and we want to become a platform where a million people can live off their creative work

It’s quite a goal, but we are confidently getting there.

Background

In most of the companies out there, the data pipelines do not get the engineering attention they deserve. It almost always starts with a grumpy engineer, writing a bunch of ad-hoc SQL or shell scripts, dumping all the production data into a data warehouse and leaving the data stakeholders to scrape out whatever meaningful data they can out of it. In the beginning, things were not different here at EyeEm. We have evolved a long way from having a read-only MySQL slave specific for data requests to using Sqoop to import parts of production databases into Hive and Redshift nowadays. On a very high level this is how our data architecture looks like:

A high-level overview of our data Architecture

We have a lot of input sources. With EyeEm favoring a Microservices based architecture in general, we have a zoo of services and:

Each service has its database.
Services emit change events via Flume.

They all end up in Hive at first, from where we process and store them into data warehouse schema tables (DWH tables henceforth). As mentioned, we use Sqoop for copying data directly from databases, Flume as our event transport layer and Spark jobs to process the raw data into Redshift tables. We also use Spark jobs to generate certain business reports, pre-processing data for Tableau dashboards and for other ad-hoc batch processing (e.g.. generating sitemaps for EyeEm). Our Spark jobs are written in Scala and, we use Amazon EMR for running long-running Spark clusters. We also have some smaller-scale Python/Pandas based data jobs that were coordinated from our Jenkins server. For data pipelines and scheduling, we installed Oozie and Hue in our Spark clusters and used it to define our data pipelines.

Challenges

This setup described above had a lot of problems:

To start with, managing any software in EMR is hard and not intuitive.
We can only install software when we bring up a new cluster — there is a very limited configuration that is available for us.
Even if we use Terraform or Ansible to automate, EMR support via the AWS SDK is very poor, that for most of the updates you will have to delete and recreate the whole cluster again!

Oozie is an excellent tool for defining data pipelines, but it relies on XML (ugh)! We are big fans of version controlling our data pipeline definitions, but it is harder to do it on Oozie on top of EMR. The EMR interfaces (including the API) make it hard to automate anything. We ended up using Hue to configure our Oozie pipelines, but the Hue UI is not at all pleasant to work with. We could not configure individual user accounts, so a group of us had to share a single Hue account, with frequently stepping on each other’s toes while trying to configure and reconfigure pipelines. There was zero information about who changed what and when.

On top of all this, we also had pipeline design issues. The upstream/downstream dependencies have not been well-configured and if something fails in between, the whole pipeline needs one person full-time to babysit it to completion and jobs could run out of order. All of these factors were severely limiting our ability to have decent data reliability for our analysts and business teams. We wanted to move to a much cleaner data pipeline scheduler/executor which can offer the following solutions:

We need to be able to version control our pipelines to make sure we get all the goodies that come with version control.
We need to be able to clearly define upstream/downstream dependencies in our pipelines. This helps us to track the progress and easily re-trigger a job when it fails and not worry about jobs running out of order in general.
We need it to be reliable and configurable (e.g: slack notifications).
We also need to be able to run non-spark, non-Hadoop jobs as well to prevent having to run data pipelines in two different locations.

Moving forward:

The first choice we considered was Apache Airflow. Apache Airflow is an excellent tool for this trade. It was built from the ground up for running complex data pipelines. It was highly configurable, it can run any type of job. You can write the airflow DAGs in pure Python code and the dependencies can be embedded in the definition itself. I even wrote a blog post about how to integrate Airflow with EMR here. But it comes with a price: Airflow with its distributed executors is a beast to set up involving Celery, RabbitMQ, and other moving parts — having to maintain and update one more distributed system didn’t look very wise at that point.

GoCD x Kubernetes

So we decided to go with the second choice: GoCD. GoCD has been coming up now and then in EyeEm as our choice to replace Jenkins and it comes with a lot of goodies:

I have already used GoCD to run data pipelines at scale before and with recent Kubernetes adoption at EyeEm and availability of helm charts for GoCD meant that we wouldn’t have to spend a lot of time setting everything up.
Also, a GoCD server is a single package installation with way less moving parts compared to Airflow. The agents themselves can be autoscaled on top of Kubernetes and each of them can have different elastic profiles assigned (aka different Kubernetes pods) which means we can run any kind of workload on them.

These pipelines can be defined using JSON/XML/YAML and stored in a git repository and GoCD can be configured to poll the repo to update itself.

Granted, GoCD is primarily a CI/CD tool. But the most appealing aspects to us are:

It treats pipelines as a first-class citizen, and dependency graphs can be formed very precisely.
It is also highly configurable with plugins and docker containers as agents. These pipelines can be defined using JSON/XML/YAML and stored in a git repository and GoCD will poll the repository to update itself (how cool!).

Here is a small section of our dependency graph visualized in the GoCD UI:

It is immensely clear what job feeds to what, and also easy to monitor the progress on any day. It also easier to plug in new jobs to this workflow. Although, this migration wasn’t without its challenges. There have been many and I would like to go in detail about a couple of important fixes and quirks we had to make.

GoCD is a CI/CD tool — expects the input (material in GoCD speak) to be different for pipeline run. I’ll explain this in detail below.
How to make the GoCD agent submit spark jobs to EMR?

GoCD ‘material’ issue:

A material for a pipeline is a cause for that pipeline to run. We shouldn’t confuse this with the trigger (that decides if to run the pipeline). Now a cause can be anything such as a git repo (any VCS repo for that matter), another pipeline, a package published to the repository such as S3 or similar. GoCD considers a pipeline to be unique only in the case when it is triggered by a changed material. If you run the same pipeline twice with the same material (e.g: the same git SHA) then it considers it as just a re-run.

This in itself is not problematic, but when you start to chain pipelines as we do, the downstream pipeline does not trigger when the upstream pipeline is run again with the same changeset. For the downstream pipeline, all it assumes is the upstream pipeline hasn’t changed, as it has run with the same changeset, so it needn’t run again. Let’s see an example & say the structure of the pipelines we had is the following:

Daily Trigger → Import Photos → Generate Photo Dimensions → Categorize Photos → Daily Run Complete

The Daily Trigger is supplied a fake material that never changes (a dummy git repo) plus it has a timer trigger that runs it every day at midnight. On Day 1 the pipelines run fine as everyone is happy. On Day 2, the Import Photo Database will never run. This is because, even though Daily Trigger runs on a timer, the Import Photo Database assumes it’s a re-run because the material has no changes for Daily Trigger! This is the same when you want to re-run a pipeline because you fixed something: the dependent downstream will never run without the changed material! We fixed this by having a dummy git repository as a material for Daily Trigger and inside Daily Trigger we made a commit with today’s timestamp into a file and pushed it to the remote. This way, we made sure tomorrow’s run will be triggered as we expected.

We still couldn’t make GoCD re-trigger downstream pipelines automatically when upstream is re-run using the same material (like when we want to re-run within the same day). This could be solved partially by:

not having individual pipelines, but rather have everything together in one massive pipeline with different stages (defeating the purpose of having clearly defined separate pipelines and foregoing the fan-out features) or,
use a more crude approach where we have one pipeline using the GoCD API to trigger the next pipeline using a script, rather than using the in-built dependencies, which will defeat our goal to have dependencies expressed in code and to version control it.

So we decided to live with that problem for now and keep re-triggering the downstream manually in case we re-run anything within a day.

Submitting spark jobs on EMR from GoCD:

If you have read so far then you already know I am not a fan of AWS EMR. Yet, we didn’t want to be bothered with running our spark cluster. I have already written in detail about why submitting jobs to EMR from outside is annoyingly hard and how we can overcome it here.

That guide is very manual and it’s nice when you are setting up things once. But we had to update EMR and spark versions at some point, and then we had to do a lot of plumbing to make them work again.

We wanted to avoid this manual work every time we update the EMR libraries and now that we can use docker containers for GoCD agents, we wanted to automate this process and maybe even version our agents with versions corresponding to the EMR cluster they are running against.
We also wanted to have a development cluster and let developers get easy access to the cluster without having to SSH into it or use any of the EMR interfaces.

With help from the amazing people on the internet, we were able to put together a basic shell script, which will create a docker image with GoCD Agent image as the base and install the required spark libraries, copy the EMR specific ones from the server. Once the image is built, it can be deployed to a GoCD server and you can invoke spark-submit commands from your pipelines. It can also be used to run spark-jobs against the cluster from your laptop.

Where is the code?

GoCD Agent build script

We know this script is not entirely re-usable, but it gives a base for anyone to work with who wants to interact with the EMR spark without having to use the badly designed EMR steps API (Like why do you even need an extra layer when spark-submit is already a good interface!).

I know GoCD is a very unconventional choice for running data pipelines, but as I have explained above it can still work and in the end, it works very well for us. We were able to make our developer’s lives much easier and also increase the reliability of our data pipelines. We would very much like to know what are your choices for running complex data pipelines and what do you think about our setup? Please get in touch in the comments. Thanks for reading!