AWS Glue vs EMR

Leah Tarbuck
The Startup
Published in
3 min readSep 2, 2020

Amazon Web Services provide two service options capable of performing ETL: Glue and Elastic MapReduce (EMR). If they both do a similar job, why would you choose one over the other? This article details some fundamental differences between the two.

AWS Glue is a pay as you go, server-less ETL tool with very little infrastructure set up required. It automates much of the effort involved in writing, executing and monitoring ETL jobs. If your data is structured you can take advantage of Crawlers which can infer the schema, identify file formats and populate metadata in Glue’s Data Catalogue. Based on your specified ETL criteria, Glue can automatically generate Python or Scala code for you and provides a nice UI for job monitoring and scheduling.

In comparison, EMR is a big data platform designed to reduce the cost of processing and analysing huge amounts of data. It is a managed service where you configure your own cluster of EC2 instances. You have complete control over the configuration and can install Hadoop ecosystem components, which makes EMR an incredibly flexible and complex service. Its use cases are vast. Data scientists can use EMR to run machine learning jobs utilising the TensorFlow library, analysts can run SQL queries on Presto, engineers can utilise EMR’s integration with streaming applications such as Kinesis or Spark… the list goes on!

You could replace Glue with EMR but not vice versa, EMR has far more capabilities than its server-less counterpart.

Another thing to consider when choosing between these tools is cost. Glue is more expensive than EMR when comparing similar cluster configurations, probably because you’re paying for the server-less privilege and ease of set up.

Drop’s Data Lake solution found a reduction in cold start time and an 80% reduction in cost when migrating from Glue to EMR.

There are currently only 3 Glue worker types available for configuration, providing a maximum of 32GB of executor memory. This restriction may become problematic if you’re writing complex joins in your business logic. If the join isn’t optimised for performance then executor memory can quickly be consumed and the job may fail. The same can occur if you have to unpack a very large zip/gzip file, all of the data will be held on one node (such is the workings of Spark!).

In contrast to this, EMR has a plethora of supported Instance Types to choose from! (although you’d still want to optimise joins to improve performance 😃 and ideally avoid zip and gzip formats!)

One advantage of using AWS Glue, is that it automatically sends logs to CloudWatch, which is very handy if your architecture uses multiple AWS services — providing you with one centralised location for monitoring and alerting. EMR on the other hand, sends logs to S3 by default — although you can install the CloudWatch agent via EMR’s bootstrap configuration.

In conclusion, if your workforce is new to AWS configuration and you only wanted to execute simple ETL, Glue might be a sensible option. However if you wished to leverage Hadoop technologies and perform more complex transformation, EMR is the more viable solution.

Thank you for reading! 😊

--

--

Leah Tarbuck
The Startup

Software Engineer at BlackCat Technology Solutions, likes cats, running and yoga :)