Leveraging Docker Override and MinIO for local data development

Published in

jampp-engineering

5 min readJan 7, 2021

One of the biggest problems when developing Big Data applications is figuring out whether or not your components will interact nicely with each other.

Integration testing usually requires setting up a staging environment in the cloud, possibly duplicating the production environment. This is not only expensive, but also cumbersome for developing: I’d much rather have everything set up locally, use my favorite editor and not bother my DevOps team with setting stuff up, granting access, and all that jazz.

The solution seems obvious at first: build a docker environment that you can use to run tests locally!

This, however, presents several difficulties:

The complexity of the setup grows non-linearly with each component that you add, given that you usually have to integrate them into many of the components already present in the environment (and in data environments, there are normally a lot).
It requires extensive automation to be useful, otherwise it would be really slow to set up each time you need it, increasing friction and frustration.
Not every component that you have in the cloud is open sourced or easily replicated locally (think AWS’ services). This sometimes can be mitigated, albeit in detriment of the previous points.

It was difficult, but once we overcame these hurdles, we ended up with a development environment that is fast to set up, easy to use, and boasts the usual benefits of using Docker.

Now we can test to our heart’s content!

The environment

As it is a reflection of our production environment, it’s likely that your setup will look very different from ours. However, it shouldn’t be too difficult to modify, and we encourage you to adapt it to your needs.

Storage layer

At Jampp, we use AWS S3 as our main storage system. It’s the glue that holds many parts of our infrastructure together, so any developing environment that we use can’t exist without it.

Unfortunately, we had no way of replicating it locally. Our previous development environment used HDFS, which not only meant constantly adjusting the code to make it run locally, but also that we weren’t really performing integration tests.

The missing piece of the puzzle came when we discovered MinIO: an open-source object storage that is compatible with the S3 API. This allowed us to have a plug-and-play component without even having to replace S3’s URLs in the code!

Besides, MinIO’s lightweight nature and elegant UI make it perfect for developing locally, as a bucket’s structure can be set up with a few clicks or with a simple script.

Note: even though MinIO implements the S3 API, not every version of the AWS SDK works with it. We found that AWS’ 1.11.534 jars work well with Hadoop 2.8.5 and up.

Orchestration

Apache Airflow is an amazing tool for building, orchestrating, managing and dynamically generating simple or complex workflows.

Being the centerpiece of our operation, this development environment is built from the ground up to run everything through Airflow.

Data processing

Historically, all of our data processing was done with Apache Hive. Nowadays, the bulk of the processing is distributed between Trino (ex PrestoSQL) and Apache Spark (run through Apache Livy). However, we maintain the Hive Metastore as the main interface between engines, using it to manage all table metadata in our data warehouse.

These three services receive jobs from Airflow, fetch and store files in MinIO, and use the Hive Metastore to obtain all the metadata they need.

Schema migrations

We track Trino’s and Hive’s migrations and table alterations through another one of our open-sourced projects: Migratron.

The docker-compose

Reusability with Docker Override

We have many projects that use Airflow, spanning many repositories. Because of this, a key feature of this environment had to be reusability without code duplication, made possible by docker-compose’s override functionality.

This functionality allows us to only extend or modify very few components of the environment in each repository, while letting them share the same infrastructure (much like when they share production clusters). Here’s an example of the docker-compose.override.yml file in one of our projects:

As you can see, it only overrides a few files used for customization with the equivalent ones in the project’s repository and the Airflow service’s image, replacing it with one that also packages the project dependencies (you could even add services specific to the project this way).

All you have to do to run the modified environment is stand in the repository with the docker-compose.override.yml file, and execute this command, which’ll link run using both the base and override docker-compose files:

PWD=`echo $PWD` docker-compose -f /path/to/base/docker-compose.yml -f /path/to/project/docker-compose.override.yml up -d

Volume management

Docker volumes are what allow us to call this a “development environment”: by using them, we can edit our Airflow DAGs’ code, for example, and the changes will be immediately reflected inside the running containers.

Another advantage of using volumes is that their data is persisted when we shut the containers down, allowing us to resume the same test later or to share the same MinIO (with all its stored data) between different projects.

Alternative running modes

I may have exaggerated when I said that you should cram a highly distributed system into a single computer: most of the time you only need to test the integration between some components, not all of them at once.

This is especially useful if your computer struggles with running the whole environment, which we found to be the case in laptops with 8GB of RAM or less. This is another instance where Docker’s volumes are great: you can perform a sort of incremental testing, turning services on or off when needed, and the data will be persisted between executions.

Closing remarks

This Docker environment proved to be invaluable to us, but it took quite a while to develop it. This is why we wanted to make it accessible to anyone who may find it useful.

It currently only represents the batch-processing side of our infrastructure, but we continue to build upon it and add new components.

We encourage you to use this environment in your projects and add services or any other improvements you’d like!

Acknowledgments

We want to thank Johannes Tang Kristensen and the Big Data Europe team for their excellent Docker images, from which we drew a lot of inspiration.

Ultimately, we made extensive modifications to those images in order to integrate more services, but please check the originals out, as you will find a wealth of knowledge there.