Maintaining +2,000 ETL flows daily

Marco Porracin
Mercado Libre Tech
Published in
6 min readMar 20, 2020

Two years ago we started facing a great challenge while trying to find an ETL solution that was able to:

  • Schedule processes
  • Alert through different methods
  • Manage a dependencies system between jobs
  • Be flexible enough so we could create new connectors as new technologies arrive at the market
  • Scale-up at demand indefinitely
  • Provide maintainability and order to more than 2000 ETL processes daily

Since we weren’t able to find a solution in the market that would fit our needs, we decided to build our own. We had a vision: to provide an ETL as a service to all users within Mercado Libre, even to non-technical users. That’s how GAIA was born.

Before digging into GAIA, let’s dive into some previous failed attempts…

A little over four years ago, the company was growing at a big pace and so were its data volumes and needs. At that time everything was built around ODI (Oracle Data Integrator) which started to cause problems when it came to maintaining a mamushka-structured ETL flow. On top of this, we started having issues to upgrade but probably because of lack of expertise with the tool.

After doing some market research, we found two very good fits for our needs: Pentaho and Talent. We ended up going for the Enterprise version of the former as it seemed to have a faster learning curve. For over a year Pentaho Enterprise started growing as our main ETL tool as we kept moving old processes away from ODI. In order to improve the usability and maintainability of the tool, we built some single generic flows where everything was parameterized since most of our ETLs were just extractions from MySQL, Oracle, and imports to Teradata (our main Data Warehouse). All the modeling and transformations were made inside the WareHouse instead of on Pentaho itself.

Adding a nice frontend to this, we ended up with a pretty complete tool that seemed to be perfect for our problems. Yet, as time passed, the need for newer more specialized connectors and flows forced us to go outside the generic flows and the lack of maintainability started to be noticed more and more. The main issue, however, was that the company kept on growing and the tool wasn’t able to scale up, which was a real problem, especially if you expect to process your whole ETL only during the night and you are on a Black Friday when the volume of events doubles or even triples.

This is why the search for a real more definitive solution started again. This time we decided to go for a fully in-house solution with great expectations, ideas and some key features the service needed to accomplish.

What is GAIA?

Gaia is our attempt to create an inhouse solution for every ETL need in the company. Our solution connects a series of “steps” (APIs) with a certain fixed configuration on each and runs them in a sequence. Every “step” consists of a different application with a single job and specific configuration. This allows us, for instance, to have a MySQL export step connected to a Teradata import step, a Teradata export step connected to a Tableau extract creation step, and so on.

How do steps communicate?

We opted for using an object storage system as an intermediary for everything*. Every export creates a file on that storage and each import takes a file from the same storage. On top of this, we use plain SQL execution steps to cast data types, apply the corresponding update or insert logic leveraging the DB motor.

What’s most important is that GAIA knows basically nothing about how any of these steps work; it just stores the available steps as names, the URL it should post to and the structure of the JSON configuration. Then each “job” is a list of steps with a given configuration that should run in sequence, calling each step’s API with their corresponding configuration in the body of the call.

Why is this so important? Because it allows people to develop their own steps with their own custom logic and configuration. We’ve built the majority of the generic steps like export, import, and execute on most DBs:

  • MySQL
  • Oracle
  • Teradata
  • SQL Server
  • Redshift
  • Hive
  • Presto

And then we built some integrations to refresh graphical tools and to take information from the APIs of tools used by some BU’s, like:

  • Tableau
  • Microstrategy
  • Facebook
  • Adwords
  • Google Analytics
  • Atlas (our in-house map tool based on kepler.gl which may be an interesting story for another time)

If that seems like a lot of work, we are now at over 40 connectors and keep developing new ones like code execution on PySpark that creates, reuses and deletes EMR clusters as a service and is totally transparent to the end-user.

Besides this logic, it’s paramount to consider some key features like scheduling, alerts, dependencies and of course logs.

Scheduling

We thought of using a simple crontab but anticipated it would not scale up well. We opted for building something a little more complex but effective: we store the last time the schedule has run and every five minutes we subtract it to the current time, in order to calculate a time frame. This interval is used to retrieve every job supposed to have run and launches it. Finally, we add a new mark meaning the schedule has run successfully. The entire process is launched by a basic crontab.

This approach allows us to pause the schedule altogether for maintenance windows in the main data warehouse.

Alerts

For alerts, we simply implemented email and OpsGenie in case of failure and also enabled the addition of some alerts in case of success. Even though this is a far more limited feature than other parts of the tool, we found it was enough for our needs.

Dependencies

Dependencies are served through a web service that informs whether a given job finished that day or not. This web service consists of a key-value storage and a queue that receives a message every time a key is updated. This way, users can perform actions based on these updates without having to request them periodically.

Logs

Logs required more work but the working principle was simple. As soon as a job is launched, a key is created for that execution and is passed to each step as an extra parameter. Thus, the steps are responsible for sending logs as requests using that key as a parameter. This allows to differentiate and store logs on a document search for each execution.

We also created a simple dashboard where users can check the status of their jobs and logs in real-time. This information is fed from both the document search and the dependencies web service.

UI

Last but not least, we built an easy to use UI so anyone, even non-technical users, could take advantage of the capabilities of the tool. With a simple drag and drop and filling up some fields, you can create your own ETL in minutes, test it and track the execution logs.

Next steps…

Even though Gaia has been our tool for maintaining and running over 2000 ETLs into our Data Warehouse and Data Lake for a year and a half helping hundreds of employees on a daily basis, after all this time we’ve found room for improvement, like the ability to parallelize work, include approval instances into workflows before going to production, have better UIs and wizards to allow new users to adopt the tools quicker, and much more. This is why we decided to start from scratch to build Gaia v2, which is being developed right now. We will be sharing more about it once it is released.

If you liked this post click on 👏 and start following us :)

*We decided to use S3 because we were already using other AWS services, but Gaia is platform-agnostic; it could be integrated with any storage system.

--

--