Automating data pipelines with Jenkins

Jenkins is your friend

One of the cool things about being a Data Scientist at Geoblink is that we get to work on all stages of the data science workflow and touch a very diverse stack of technologies. As part of our daily tasks we gather data from a range of sources, clean it and load it into our database; run and validate machine learning models; and work closely with our DevOps/Infrastructure team to maintain our databases.

As it happens in other start-ups, as we grow rapidly it becomes more and more important to automate routine (and indeed boring) tasks, which take away precious development time from our core developers, but also from us data scientists.

While automation tools have long been used in software development teams, the increasing complexity of data science cycles has made clear the need for workflow management tools that automate these processes. No surprise then that both Spotify and AirBnB have built (and even better, open-sourced!) internal tools with that aim: Luigi and Airflow.

As part of our effort to iterate faster and promptly deliver our client requests, in the last couple of weeks I’ve spent some time working with the great automation tool we use, Jenkins, and in this post I’d like to give you a taste of how we use it in the Geoblink’s Data team.

Jenkins is a well-known open-source automation software which supports a variety of features for continuous integration and continuous delivery. There are many reasons why we love Jenkins at Geoblink:

  • Advanced job scheduling
  • Support for Source Control Management
  • Quality logging
  • Dependency graphs
  • Slack notifications
  • Fully customizable with plugins
  • Thriving open-source community

And the list goes on and on, but the impact is always the same: a drastic increase in team productivity and a minimization of human errors.

In Geoblink’s Data team we have created Jenkins jobs to handle tasks like:

  • Database restores / backups
  • Deploy new map styles to different environments
  • Scheduled polling of API & data transformation followed by database updates
  • Automated end-to-end tests of the web application

Here we have an example of a Jenkins file, which defines a simple Jenkins pipeline with two stages: in the first stage, Jenkins launches a job that polls an API and formats the fetched data; in the second stage, a job is launched that loads that data into a database and performs the corresponding updates. The pipeline allows for parameters to be passed at runtime, in this case the parameter `DB_HOST` specifying the host where the database server runs. In turn, each of the build stages can be handled by scripts written in different scripting languages, such as bash or python.

pipeline { 
agent any
parameters {
string( name: 'DB_HOST',
defaultValue: 'host',
description: 'Database host where update is performed')
} stages {
stage('Get Last File') {
steps {
echo 'Stage 1: Getting last CSV file with new data...'
build job: 'data.get_last_file'
stage('Update Database') {
steps {
echo 'Stage 2: Updating database...'
build job: 'data.update_database', parameters: [string(name: 'DB_HOST', value:params.DB_HOST)]

As you can see, the simple Groovy syntax is very clear about what’s going on and very easy to learn. Pipelines are a very powerful feature that was recently added to Jenkins. They improve in many ways traditional chained Jenkins jobs, and one of its main advantages is that they allow for users to completely script out the behaviour of the job instead of relying on Jenkin’s UI or XML files. Moreover, they provide very rich logging and visualization of the pipeline progress and output, as we see in the following screenshot.

Finally, one of key messages I’d like to convey is: don’t be afraid of tinkering with Jenkins! Its number of advanced features might seem overwhelming, but the basics are easy to grasp and you’ll be up and running in no time.

While in the near future we might be exploring the wonders of Luigi or Airflow, for the time being Jenkins allows us to easily automate jobs and at the same time use a technology that is already well-known to our Core & Infrastructure teams.

By Jordi Giner