Choosing the right data processing option in Amazon SageMaker

Vikesh Pandey
8 min readMar 13, 2023

--

Amazon SageMaker provides various ways in which you can perform data processing to make it ready to feed into a machine learning training job. But Do you know all the different options and when to use which one and when to choose one over the others?

If your answer to any of those questions is a No, then this blog is for you. In this blog we will cover:

  • What all different options are there for processing data in SageMaker
  • How to decide which one to use?

NOTE: This blog only focuses on the data processing options in SageMaker and does not cover other offerings from Analytics space in AWS.

What all different options are there for data processing in SageMaker?

With in Amazon SageMaker, you can use any of the following for processing the data:

  1. Amazon SageMaker Data Wrangler
  2. SageMaker Studio Notebooks
  3. Glue Interactive Session in SageMaker Studio notebook
  4. Amazon EMR via SageMaker Studio notebooks
  5. Amazon SageMaker Processing

How to decide which one to use?

This is an extremely generic question. Which option you use would totally depend on your requirements. In order to make an objective assessment of these options, we are going to understand the following aspects for each of the offerings:

  • Developer experience
  • ML lifecycle phase applicability
  • Supported data types and dataset size ranges
  • Scope of customization
  • Cost

Your answer to the above points will decide will option is right for you. Is there any other area which you think is very important but not mentioned above? Let me know in the comments.

With that said, lets start with the our analysis for each of the offerings.

1. Amazon SageMaker Data Wrangler

TLDR: Use SageMaker Data Wrangler when:

- You want to experience a Low-Code/No-Code way of doing data preparation.

- The data transformation rules provided by data wrangler fit your needs and you don’t need much customization.

Amazon SageMaker Data Wrangler is a feature available with in Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data.

Developer Experience

This is a UI only feature. There is no API available for Data Wrangler. You can only access it from with in SageMaker Studio. You can think of it as a Low-Code/No-Code data preparation capability. Something which would fit both for a business user persona as well a data scientist persona. You get 300+ built-in data transformation rules, which you can apply on your dataset in real-time to transform your data. You also get an option to export the whole data transformation steps as code, all done by SageMaker Data Wrangler.

ML Lifecycle phase applicability

This can be used during the Exploratory data analysis phase where you use the Data Wrangler UI to transform your data. You can also use it as part of your training pipelines, because data wrangler generates code for all your data transformation steps, as code.

Supported data types and dataset sizes

At the time of writing this, It supports only tabular data. It can take in CSV, Parquet, JSON, JSONL and ORC formats. There is no clear limit on how much data, it can process. But since its a UI offering, its a good idea to import a sample of the dataset in this UI, do the transformations, export the transformation flow as code, and then run it on the complete dataset as an automation job. Infact, data wrangler, provides you all the code for automation already, just edit the data source location in the exported code, to point to the complete dataset, instead of a sample.

Scope of customization

In case there is no transformation rule fitting your requirement, you can write your custom code in Python (User-Defined Function), PySpark, pandas, or PySpark (SQL). Data wrangler is technically a container running on a SageMaker Managed EC2 instance. You can change the instance type as per your requirements.

Cost

It will depend on which instance type you choose and how long you use the Data Wrangler for. Check the pricing page for more details.

Pro Tip: Always remember to shut down the data wrangler app from Studio when not in use. Check Shut Down Data Wrangler for more details.

2. SageMaker Studio Notebooks

TLDR: Use SageMaker Studio notebooks when:

You want to experiment with all smaller sample of the dataset, eventually to run a larger processing job.

Dataset sizes are not very large(to avoid long running notebook cells).

You want to have flexibility to install different packages and libraries to test things out.

Developer Experience

You can use this capability by simply launching a notebook in SageMaker Studio, and choosing either a pre-built SageMaker Studio image or your own custom studio image. You can dial up or down on the instance size(vCPU, Memory) on which the notebook is running, allowing you to scale up as needed.

ML Lifecycle phase applicability

Though this capability is primarily available for being used in the experimentation phase, Studio recently added the capability of running the notebook as an automated job as well.

Supported data types and dataset sizes

Since its just a notebook, it does not matter which data type you are using. Though regarding the dataset size, you need to choose the right instance for notebook so it has enough memory to hold and process the data.

Cost

It will depend on which instance type you choose to run the studio image on. Remember your notebook runs inside the image. By default, Studio runs all CPU bound kernels on ml.t3.medium and all GPU bound on ml.g4dn.xlarge. but you can the instance types.

Pro Tip: Always remember to shut down the instance when not in use. You can use sagemaker studio auto shutdown extension as well to automating shutting down idle instances.

3. Glue Interactive Sessions in SageMaker Notebooks

TLDR: Use Glue interactive session when:

- You want to experience serverless data preparation in SageMaker.

- You want to use Spark

Developer Experience

You can use this capability by choosing an appropriate SageMaker Studio Image(SparkAnalytics 1.0, SparkAnalytics 2.0) and Kernel(Glue Python [PySpark and Ray] and Glue Spark) with in SageMaker Studio notebooks. Just launch a new notebook, select the right image and kernel and start running your data processing code in the notebook. The code execution happens on serverlessly SageMaker managed spark cluster.

ML Lifecycle phase applicability

As of now, you can use this during experimentation phase. For automation, you can run the whole notebook as a job to achieve some degree of automation.

Supported data types and dataset sizes

Since this is all spark based processing under the hood, the data types supported by spark are all supported here as well. It can handle 10s of TBs of data for processing. It also supports Ray as a framework so again Ray supported data types will work as well.

Scope of Customization

Despite being a serverless offering, you can provide your additional libraries and multiple py files while submitting the processing jobs. Checkout this code for sample reference.

Cost

When you use AWS Glue Interactive Sessions on SageMaker Studio notebooks, you are charged separately for resource usage on AWS Glue and Studio notebooks. Checkout AWS Glue Interactive Session Pricing for more details.

4. Amazon EMR via SageMaker Studio

TLDR: Use EMR with SageMaker Studio when:

- You have Petabytes of data to process.

- You want to leverage large scale distributed processing on spark

- You are already familiar with Hadoop framework stack.

Developer Experience

You can use this capability with SageMaker Studio Just launch a new notebook, select the appropriate image(Data Science, SparkMagic) and kernel(PySpark, Python3) and connect to an existing EMR cluster, either within the same account or cross account. You can submit an EMR processing job right from the notebook and the job runs remotely on the EMR cluster.

ML Lifecycle phase applicability

You can use this during experimentation phase as well as part of automation, since SageMaker Pipelines has an EMR step as well.

Supported data types and dataset sizes

Since this is all spark based processing under the hood, the data types supported by spark are all supported here as well. It can easily scale of petabytes of data for processing.

Scope of Customization

You can customize the execution environment(your EMR cluster) as per your requirements. SageMaker.

Cost

You pay for the notebook usage in Studio till its running. Check SageMaker Studio pricing for details. Along with that, you also pay for the EMR processing job execution. Check EMR pricing for details.

Pro Tip: Since both EMR snd Glue Interactive Session provide spark based processing, when to choose which one?

Choose EMR over Glue Interactive session when:

- You have petabytes of data to process

- You are already familiar with Hadoop framework stack.

- You want to have more control over the execution environment in terms of customization

Choose Glue Interactive session over EMR when:

- You have TBs of data to process

- You dont want to manage the spark infrastructure by yourself

- You have less customization to perform in your execution environment

5. Amazon SageMaker Processing

TLDR: Use SageMaker processing when:

You want to prepare data by writing/using data processing code and libraries

You want to achieve highest degree of customization among all options of data processing in SageMaker.

You want to leverage spark based distributed processing.

You want the processing to be fully automated.

Amazon SageMaker Processing a generic “cluster on-demand” offering by SageMaker which you can use to run any code with any container and dependencies of your choice. It also provides two managed containers: Scikit-Learn container and Spark container to start with. For every processing job, SageMaker spins up an ephemeral cluster, runs your processing job and then shuts down the cluster as soon as the job is finished.

Developer Experience

Its basically an API so the best way to work with it by using the API, through code only experience.

ML Lifecycle phase applicability

It is beneficial to use once you have figured out your processing code and want to run it against your raw dataset. So the best application of this feature is when you are automating your data processing/training pipelines. While you can still try using it for incrementally building your processing code in the exploration phase, but the cluster warm up time will slow you down.

Supported data types and dataset sizes

Since its just a cluster on-demand, it does not matter what data type you have, you can choose to process any data type in SageMaker Processing. Regarding dataset sizes, it can comfortably handle to 100s of GB of data.

Scope of Customization

Pretty much everything can be customized. You can choose to bring your own container, with your own dependencies, libraries, choose the type and number of instances you need in your processing cluster, run it on demand or as part of automation. It provides highest degree of customization among all the options.

Cost

You are charged only for the duration of the processing job. The pricing varies based on the instance type and instance count you choose for your sagemaker processing cluster. Check out SageMaker Pricing for more details.

Pro-Tip: Instead of starting a job with all the data you have, Start with smaller dataset sample and extrapolate how big of an instance and how many of them you would need. This would save you lot of time and effort and of course cost :)

Conclusion

To summarize, SageMaker provides various data processing options and what you choose will depend on what kind of developer experience you are after, in which phase of the ML lifecycle you are in, what data type and dataset size you are dealing with and how much customization requirements you have. There is also a factor of cost, which will again vary based on how much compute you use with any of these offerings and for how long you use them. Infact, you may end up using more than one of these options in the same ML lifecycle. Feel free to try it out and let me know your feedback in comments.

--

--

Vikesh Pandey

Sr. ML Specialist Solutions Architect@AWS. Opinions are my own.