AWS EMR vs. AWS Batch vs. Ray framework

Almir Mustafic
4 min readJul 13, 2020

--

Let’s say you have a lot of data to process and now you are going through a portfolio of tools and approaches to decide what your next step is. It can get very confusing and even after picking a tool/approach and committing to it, you may be doubting yourself.

This blog post is about:

  • Giving you a quick overview of a few technologies in this ecosystem
  • Describing what these technologies are meant for

Here is a quick overview:

AWS EMR:

  • Library: Python PySpark library needed.
  • Coding style: needs to be adjusted to work around PySpark library.
  • Very good in preparing the data before action-ing the data.

AWS Batch:

  • Library: NO library needed.
  • Coding style stays as you always write your batch application.
  • Not good in preparing the unstructured data, but it is very good in taking an already prepared file and looping through and performing actions for each record.

Ray framework (Python Ray library):

  • Library: Python Ray library needed
  • Coding style: Slight adjustment but for the most part, its mission is about writing your functions and classes as usual. You just decorate the functions/classes slightly differently and when you invoke functions, you indicate if you want to run it “remotely/distributed” way. The same code works locally on your machine as it runs distributed.
  • Good in BOTH preparing the data and also processing each record in the prepared file in a distributed fashion.

Latest documentation on Ray: https://docs.ray.io/en/latest/

Screenshots of a presentation summarizing how Ray approaches the distributed programming using function and class decorators while maintaining the regular Python code within functions and classes.

Let’s now look into some use cases and see where each one of these could be useful.

Diagram 1 (below) is the typical machine learning example. You have data coming from many different sources and you are ingesting that data into a data lake. Then as part of machine learning process, you need to have distributed processing components that do the following:

  • Read a lot of files/data from the data lake and create a big sample CSV file for the machine learning algorithm to use for training.
  • Read a lot of files/data from the data lake to generate/keep and an up-to-date version of all features/attributes for each customer that you may need to predict/infer for.

This use case is a good example where you could use the following:

  • AWS EMR or AWS Glue (Apache Spark as back engine)
  • Ray framework
Diagram 1

Let’s consider another example. Diagram 2 (below) is an example where you you already have some CSV/JSON files prepared and now you need to traverse through the records within those files and perform a list of actions for each record. One solution is having a single-threaded application that traverses through the records and performs the actions; however, this would be extremely slow if you were dealing with a large number of records. Another way is writing a multi-threaded batch application using your choice of a programming language, but this solution limits you to a single server and your ability to scale horizontally is not there. Next, you can start thinking about splitting these files/records across multiple servers and performing this in parallel and instead of building a custom solution for this, AWS already has one and it is:

  • AWS Batch

For example, AWS Batch allows you develop a standard Python application and execute it within the AWS Batch environment which takes care of all the distribution and scaling. What you need to do in AWS Batch is:

  • Create a Compute Environment (minimum, desired, maximum CPU setting)
  • Job Queue (associate with a compute environment)
  • Job Definition (Docker container with your application and vCPU and Memory settings)
  • Jobs (Pull all of the above together in a form a job that also allows you choose single or multiple EC2 usage)

This approach allows you execute the batch in a very scalable way, but you have to make sure that the downstream systems can handle the scale. For example in the diagram below, Action A is an API call to a service that you or your sister team may own. It could also be a service that is owned by a 3rd party. You need to make sure that it can handle it the extra traffic coming from your AWS Batch application.

Diagram 2

I hope that this article paints the big picture of this ecosystem so that you can research more and find an optimal solution for your use cases.

Thank you for reading. Keep geeking out!

Almir Mustafic

--

--

Almir Mustafic

Director of Software Engineering, Solutions Architect, Car Enthusiast. Opinions are my own. (http://AlmirMustafic.com)