ETL as a Service at In Loco

Pedro Rossi
Inloco Tech Blog
Published in
5 min readMar 12, 2020

ETL as a Service is a dream come true for every pipeline system inside an organization. Extract, transform, and load data quickly and simply without having to rely on other teams to do so. The requirements to achieve ETL as a Service are the following: there must be an execution engine to execute the code, a scheduler to trigger the execution engine given date and time, and a storage engine to retrieve the code from.

The execution engine

If you write any code in Scala, Python or whatever languages you want to design your pipeline with (at In Loco it is mostly Scala (Spark) and Python, here is a link to an explanation about our data architecture), this piece of code must run somewhere, it can be on your computer or in the cloud, and there must be a tool that coordinates this execution. To coordinate means execute, have a retry logic and know if the execution failed or not. Nowadays, some tools execute pieces of code. Tool examples are Luigi, Airflow, Kubernetes Jobs, AWS Lambda, Argo Workflows, and others.

The scheduler engine

Well, the code is written, and we know how to execute it, but it is an ETL, we are not going to alarm a developer/data scientist every hour, so the person can click a button to trigger a pipeline, people need to sleep. There must exist some way to trigger this code based on the current date and time, to alarm the responsible if the execution failed and to store the logs generated by the execution, so it is possible to debug it. Some of the tools, like Airflow, do the execution and the scheduling. Other tools extend the execution engine with a scheduler like Kubernetes CronJobs and AWS Data Pipelines. In contrast, other tools focus solely on scheduling like Rundeck, which we are going to talk about further.

The storage engine

Every execution needs somewhere to retrieve the code from, and regarding this topic, we don’t have many choices. It mostly summarizes between git and AWS S3. Git may seem like the obvious option, but if your code produces an output to S3, you may want the code and data to be together. It is also easier to deal with S3 than git, but keep in mind that you lose code versioning by choosing S3.

The era before ETL as a Service

Before thinking about how to make ETLs more simple and self-service to everyone, the company pipelines were triggered through Airflow or Data Pipelines, and both had an interface to launch Spark Jobs through AWS EMR. Both solutions were enough to deal with the products pipeline. Still, the Business Intelligence team had to rely on having engineers who knew how to implement Spark Jobs to schedule their pipelines. Both solutions were not productive since they had no autonomy in running and implementing their aggregations.

The first version of ETL as a Service

For the execution, we decided to go with Argo Workflow since we were already testing it for some internal pipelines, the docs are impressive, and we already had the experience of working with it. The workflow definition had a gitsync sidecar app to download the git repository as the first step. Because most of the jobs are written in Jupyter notebooks, we used Papermill to execute the notebooks, and if the job failed, the workflow would trigger an HTTP request to Slack to alert the job failure. To schedule the workflow execution, we used Kubernetes CronJobs.

Through Jupyter notebooks, we used sparkmagic to communicate with Spark, and we implemented a magic on top of the SQL magic to communicate with Presto also.

This approach got the job done, but it had some issues. It is not easy or intuitive for non-developers. For example, data scientists usually don’t know Kubernetes, and this is not a bad thing. They should be able to focus more on the data and not the infrastructure behind the data. This problem had an initial workaround that was to search from time to time the git repositories that had notebooks with certain tags saying when to execute it, but this still wasn’t intuitive for non-developers.

The second (and current) version of ETL as a Service

After studying scheduling tools, we ended up choosing Rundeck. It is a tool that provides a spectacular UI for bash commands scheduling. It also offers logs, alarms, jobs per project structure, parameterization of bash commands, access control per project, is easy to customize and has an open-source version, talking like this it looks like it is perfect, but it is almost perfect.

We needed to add a simple bash script to make Rundeck talk with Argo and also add the S3 log plugin to send execution logs to S3.

After testing this architecture internally, we noticed that the feature of defining Directed Acyclic Graphs with Argo Workflows was not being used which made us stop using Argo since Kubernetes Jobs are good enough for us and Rundeck already is capable of defining a simple pipeline through its UI, but this had a minor issue, we had to add the Slack notification plugin to enable Slack notifications (this plugin had to be refactored to work using Slack Apps instead of Webhooks).

While this model started working fine, more and more teams needed to use the platform (we were migrating from AWS Data Pipelines and Airflow), and we decided to create a generic CLI to trigger the Kubernetes Job execution through templates. This made Rundeck more generic to everyone.

Nowadays, if anyone wants to schedule an ETL, we have two job templates on Rundeck, the notebook and the spark job template, the user needs to set the notebook path or the jar location, set the time and day to run it and set the alarm cases for the job. The best part of this architecture is that anyone is able to schedule jobs through the Rundeck UI at any time without having to rely on other teams.

Further steps

While our current architecture is elegant and covers our use case, we can still see that it is possible to make the execution engine more generic and extend it to ensure better auditing and access control of the executions. Luckily the rundeck-plugins repository already has a Kubernetes plugin that is a good starting point for development. Another point of improvement is to make more Rundeck-accessing interfaces, and one example is a Jupyter plugin so the users can schedule notebooks from within it.

Are you interested?

If you are interested in building context-aware products using a powerful location technology that genuinely cares about user’s privacy, then check out our opportunities.

--

--