A simple tool set for data flows

Igor Chtivelband
Billie Engineering Crew
4 min readOct 18, 2017
© Illustration: Caterina Carraro/Billie

ETLs and Maslow’s Hammer danger

Since Billie is a true data-driven company we have to deal a lot with ETL (Extract Transform Load) tasks. As the name implies these tasks arise when there is a need to take some data from data source A, transform it according to the desired format, and finally put it to destination B.

Our main data sources are Facebook, Google AdWords, Salesforce and, naturally, production database. Transformations usually include change of date format, integration of lookup keys, duplications removal, etc. Data Warehouse is the typical data flow destination since it is used as the single source of truth for the data analysis needs.

So let’s assume you have a table T1 in database D1, which you want to be the data source for table T2 in database D2. If databases D1 and D2 are deployed on different machines, then you have no choice, but to build a data flow.

Basic data flow

Many developers in this situation say to themselves: ”I know Python/Ruby/PHP/Java/Klingon programming language, so how complicated is it to write a short script which will connect to database D1, read table T1, connect to database D2 and finally push data to table T2”. Of course, it is not rocket science, and such script can be written in any modern programming language, but this is the perfect example of “Law of the instrument”.

The concept known as the law of the instrument, otherwise known as the law of the hammer,[1] Maslow’s hammer (or gavel), or the golden hammer,[a] is a cognitive bias that involves an over-reliance on a familiar tool. As Abraham Maslow said in 1966, “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” (“Law of the instrument” n.d. in Wikipedia, from https://en.wikipedia.org/wiki/Law_of_the_instrument)

Talend Open Studio for Data Integration

Instead of trying to reinvent the wheel, it makes sense to use a dedicated instrument: an ETL tool. In Billie we use Talend Open Studio for Data Integration. It allows to build a data flow using mainly drag-and-drop and compile it to .jar file + shell script, which can be run on any *nix OS (actually on Windows as well) with JVM.

There are many tutorials on Talend Open Studio available on the Internet, thus I will not describe step-by-step how to create a simple data flow ( “job” in Talend’s terminology).

Building simple Talend Data Integration job (step 1)

Once your job is ready and tested, build it as “Standalone Job”. There are also other options, like “Axis WebService (WAR)” file, which later can be deployed on any Apache web server, but let us focus on the simple scenario.

Building simple Talend Data Integration job (step 2)

Once you have extracted the generated ZIP file you will see that it contains the folder with the .jar file, which encapsulates the compiled data flow elements (yes, they are compiled into Java code) and the shell script, executing .jar file. Naturally you can execute the .jar without the shell script, but personally, I find the usage of the generated shell script quite convenient.

Example of compiled Standalone Job

As I already have mentioned, you can manually copy the extracted folder to a remote machine and execute this job from the console or schedule its execution using CRON. There are few problems with this approach:

1) In case of “manual” deployment, there is no versioning and out-of-the-box rollback option

2) CRON scheduling will allow you to execute a job periodically, but orchestration of multiple jobs may become tricky

3) As any other manual operation, it is error prone

That is why in Billie we use GitHub for versioning and Jenkins for deployment and invocation of compiled Talend jobs.

GitHub

Once Talend job is compiled, it is pushed by developers to GitHub. Naturally, there are different branches for different environments. Once a new version of a compiled Talend job was pushed to GitHub it can be either automatically deployed to an ETL server using a web-hook or pulled from GitHub by invoking a build of Jenkins deployment project.

Jenkins

Jenkins is an extremely popular continuous integration and delivery application.
Jenkins helps us to achieve two data flow related goals in Billie:
-It provides reliable deployment process
-It makes possible building sophisticated data flow orchestrations (e.g. run job A after job B / don’t run job A and job B simultaneously / run job B only if job A failed, etc.)

Summary

In Billie, we develop data flows using Talend Open Studio for Data Integration, store them using GitHub and finally deploy and run these data flows using Jenkins. This simple, yet powerful toolset allows us efficiently build and maintain numerous data pipelines.

© Illustration: Caterina Carraro/Billie

--

--