Image for post
Image for post

Building our Tower of Babel: Django + Singer ETL + Airflow for all audiences

Jperozo
Jperozo
Sep 1 · 6 min read

Author’s comment: this text was created for all audiences to tell our experience. If you’re a developer, you might be interested in reading Django + Singer ETL + Airflow: blood, sweat and Docker.
However, there’s a tag,
[dev-interest], present throughout the text in those technical details as an explanation of certain details. If they are not of interest to you, you can skip them and it should not affect your reading.

According to one Bible passage, all human beings used a unique language and tried to build a tower, known as the Tower of Babel, to get to heaven. God, seeing it as an act of rebellion, decided to divide them by creating different languages so that they could not understand each other and abandon their arrogant intentions. Although many manage to master more than one language, in the web world there are similarities with this story.

Image for post
Image for post
The Tower of Babel would be located in what is now the territory of Baghdad

Let’s imagine for a second that you want to develop a new app for any purpose you want (to be a millionaire, to simplify a process or just for leisure). It is quite likely that you will decide to connect to systems that you like or that you use in your day-to-day life, right?

To achieve this, we usually communicate with them through their API’s (for practical purposes of this text we will understand them as the language with which we understand them) and these are as diverse as the existing tools.

In any scenario, the usual thing is to create the connections with those platforms and write the code corresponding to the situation. But what happens when the requirement is something broader and more abstract than that? Well, things get complicated and require more drastic solutions.

The project we faced had as a goal an AI that was able to extract, analyze and classify periodically large amounts of information from multiple external sources and that, at the end of this whole process, it would show the result in a web system.

Image for post
Image for post
Scene where Ultron scans different media and interprets them. No, we didn’t create a supervillain but it’s the same analogy. Image owned by Marvel Studios

Following our usual roadmap when faced with a new project involving the manipulation of a database, we used Django as a framework because of the amazing power it has for this purpose.

Django is a globally known framework for web development. If you work in the field, chances are you’ve heard its name before (even if python isn’t your weapon of choice). If, on the other hand, you’re not a technology person, it might be interesting for you to know that the NASA site and part of Instagram are developed on this platform.

On the other hand, for the data extraction from the different sources -our initial plan- we did as usual: connecting us one by one to all these services, in the way we were used to.

But there was a problem. The amount of services to be integrated was not proportional to the time we had for the development.

It was then that we started looking for alternatives, trying to use services that are exclusively dedicated to these connections. But this came with another classic problem: we didn’t have the investment required for it. It was here that we found a third option, Singer ETL.

Singer ETL is presented as a standard for the extraction, transformation and loading of data for any source of information, sharing the same format. It was the common language we were looking for to build our tower of Babel.

Image for post
Image for post
One standard to rulle them all

[dev-interest]

This “only ring” of the APIs is divided into two elements:

Tap

It is the element that connects to the API using a configuration file (whose format may vary between taps), compacting the information into three structures known as Schema (which defines the fields for each of the structures), Record (the representation of each of the records) and Stream (a final summary of the entire extraction process) with a structure similar to this one:

{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "users", "record": {"id": 1, "name": "Chris"}}
{"type": "RECORD", "stream": "users", "record": {"id": 2, "name": "Mike"}}
{"type": "SCHEMA", "stream": "locations", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "locations", "record": {"id": 1, "name": "Philadelphia"}}
{"type": "STATE", "value": {"users": 2, "locations": 1}}

Target

It is the one that receives the information extracted by the TAP to save it in the required format.

The Singer community itself, as with the taps, has made public several targets such as CSV or Google Sheets but also have a template so you can develop your own, which was very useful for our use case.

[end-dev-interest]


With the problem of the connections of the different sources solved, simply by installing the required components, we needed an orchestra master that was able to provide the necessary credentials for each of these services and, at the same time, determine the order of extraction of these data -especially taking into account that not all users could require the same sources of information or the same frequency of access.

It was in the search for this vacancy that we came across two contenders who could execute the job optimally: Apache AirFlow and Luigi.

Both are systems for building workflows, made by two great technology companies: in one corner was Apache Airflow, developed by the AirBnB team, while in the other was Luigi, masterfully built by Spotify.

Although both did what we needed masterfully, Airflow covered aspects that Luigi didn’t.


[dev-interest]
Airflow works under the premise of Targeted A-Cyclical Graphics. The particularity of these graphs is that all their nodes go in a common direction so there is no path that starts and ends at the same vertex.

Image for post
Image for post

In Apache Airflow each of the tasks would correspond to a node in this network, which allows defining flows in a quite natural way.

[end-dev-interest]


Image for post
Image for post
Don’t worry, Luigi, we still think you’re cool. Next time, you’ll probably be the winner.

The problem with this tool is that it doesn’t get along very well with Django. At the time of writing this article, the documentation was quite limited and the little amount of it we were able to collect did not cover the use case we were requiring for our development.

That’s when we realized that forcing them together was not going to be the solution and we tried a completely different approach: keeping them separate, creating the necessary bridges for the configuration and execution of the task flows.

We achieved this by taking into consideration a peculiarity of AirFlow, which is that it constantly scans a previously defined directory in your operating system, where all the definitions of the task flows to be executed are hosted, either by direct order from the user or for a specific period of time.

So we went for it. We generated the configuration and flow files for each of the services from which we wanted to extract information and it successfully did the job we had entrusted it with after, at least, 150 failed attempts.

Once the information was obtained, it was only a matter of organizing and classifying it so that the AI could interpret it in the terms that had been required and show it to the user in question. The next part of the way was more like what we are used to dealing with on a daily basis.

We had finally built our tower of Babel.

We build fantastic digital products for startups and major brands.Let’s build something big together

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store