Use Docker containers to locally develop ETL jobs compatible with Amazon EMR

Davide Romano
MDS-BD
Published in
2 min readNov 6, 2022

This is a short excerpt from the original article.
Link to the official blog post: Soon available…

Mediaset data lake stores everyday more than 100 gigabytes of raw data generated by the clients and their interaction with the Mediaset properties and platforms: streaming platforms, news websites, blogs, and radios. Several types of data are ingested: click-streams, page views, video views, video player interactions, marketing campaigns results, help desk tickets, social media feedback, and many others.

Catching and storing raw data is not the final goal — people from different teams need processed data to accomplish their daily tasks. For example, data scientists need users’ media sessions to train predictive models, and business analysts need insights on users’ subscriptions.

Mediaset Business Digital team built an ETL pipeline that gets raw data, transforms it with business logics, and persists it to the data lake. The solution is based on a development environment that allows us to develop ETL jobs locally on subsets of data using an IDE that provides all the features and tools useful for efficient development, and finally to deploy tested code on a cloud service that can process data at scale.

Mediaset’s data ingestion and transformation pipeline (Image by the Author)

Mediaset Business Digital uses Amazon EMR to accomplish data transformation workloads. The data engineering team constantly deploys ETL jobs on EMR clusters which, orchestrated and scheduled on Apache Airflow, transform and aggregate raw data.

However, developing complex ETL jobs for Amazon EMR is not a simple task.

When you transform and aggregate gigabytes of data on a daily basis, you have to pay attention to what and how you code, otherwise your jobs could fail or not be optimized enough to run in a reasonable amount of time.

For this reason, Mediaset Business Digital developed a custom Docker image that contains all the essential libraries needed for the development of an ETL job. After you build the Docker image, you can import it into your favorite IDE as a remote interpreter and run your code as if you were running it on an EMR cluster. Thanks to the AWS Glue SDK installed in the Docker image, you can get the data directly from the AWS Glue Data Catalog instead of reading data directly from S3 buckets.

To be continued…

Stay tuned, the full article will be available soon.

--

--