Stories by Benoit de Patoul on Medium

Trigger concurrent Amazon SageMaker jobs at scale with AWS Step Functions

Benoit de Patoul — Mon, 06 Mar 2023 09:21:22 GMT

An example about triggering SageMaker jobs or ML pipelines at scale in a controlled manner using Step Functions.

Typically a Machine Learning (ML) lifecycle within SageMaker (SM) triggers what we call jobs. These jobs allow the user to take advantage over the fully managed SageMaker resources. Sometimes SM users need to trigger different SM jobs or pipelines at scale and in concurrency. It can be from having to trigger multiple simple training jobs to multiple ML pipelines. This can rapidly become challenging and complex due to different limitations/quotas such as the maximum number of jobs, the API throttling limits, the logic to retry failed jobs, etc. As a best practice it is recommended to spread in time the amount of jobs to avoid hitting quotas and peak times. All of this needs to be done in a controlled manner.

This blog explores and shows an example on how can this be achieved by mainly using Step Functions. This example can be built within your AWS account by deploying the provided Cloud Formation template. The architecture combines all together the blog “Controlling concurrency in distributed systems using AWS Step Functions”, the sample project “Train a machine learning model” from Step Functions and the Amazon Simple Queue Service (SQS). This solution uses the concurrency control and retry policies of the blog combined to a Machine Learning (ML) lifecycle that includes its own retry policies, and Amazon SQS that contains the number of tasks yet to be run and its parameters. This architecture can be adapted to your own use case and it is discussed how later in the blog.

Photo by Geran de Klerk on Unsplash

How does the architecture work?

The architecture uses a distributed semaphore to spread and limit the number of concurrent SageMaker workloads, and a SQS queue that contains all the remaining tasks to be performed. Step Functions queries via Lambda the task from the SQS queue and runs the workload when there is available space within the chosen concurrency limit. A task could be anything from just generating or preparing a dataset, to run a whole ML lifecycle. In this blog I run a ML lifecycle “Train a machine learning model”. This architecture has been tested by running 300 tasks with a concurrency of 20. The result was successful by completing all the tasks, respecting the maximum number of concurrent running tasks and retrying any failed task.

The architecture is made of three state machines from the blog. To understand in details each of those state machines please have a read at the shared link above. The ones we will focus on this blog is the CC-Test-Run100Executions and CC-ConcurrencyControlledStateMachine state machines.

The state machine CC-Test-Run100Executions is the state machine used to demonstrate the concurrency control. It has been adapted to retrieve the number of tasks in the SQS queue. The CC-ConcurrencyControlledStateMachine is where you define the maximum number of tasks that can be run concurrently. In this example, the task is the maximum number of ML lifecycle.

The ML lifecycle task ( “Train a machine learning model”) is part and defined in the CC-ConcurrencyControlledStateMachine. It generates a dataset, trains a model, saves the model and then applies a batch transform on the data. I have added retry policies in case of any failure happening separately in the training job, on saving the model, or in the batch transform. The Generate dataset step contains extra code within the Lambda function to pick a task from the SQS queue. I have also added an extra step at the end of the task that deletes the SQS message once the ML lifecycle is successfully completed.

How to use the Cloud Formation template?

You can download the Cloud Formation template from this link. You need to log in into your AWS account, go to the Cloud Formation service, create a new stack and then upload the template.

You will see three parameters. The ParameterInstancePrefix and ParameterLockName are related to the Dynamo table. The first one is the table name, and the second one is the partition key name. You do not need to change any of those. Lastly, the MaxConcurrency parameter is the number of maximum concurrent tasks running. To be more precise, for this example is the maximum number of ML lifecycles running at the same time. You can increase this parameter but consider your quotas.

Once all the resources from the template have been created, the first step is to generate dummy messages in the SQS queue called SQSQueueJobs. Go to Lambda and run/test the function named LambdaMessagesGenerator. This function will generate 100 messages in the SQS queue (you can manually update the variable number to reduce the number of messages). Since this is just an example, the messages do not contain any parameter that will be used by the “Train a machine learning model” task. The number of messages allow the logic to know how many tasks are planned to be run. In real life scenario, the SQS messages would include parameters that are needed for the task to be successful. These parameters could be about the number of instances, the type of instance, the location of the data, etc.

Now that we sent messages to the SQS queue, go to Step Functions and start the execution of the CC-Test-Run100Executions state machine (this will incur costs!). This will trigger the whole logic:

Query the SQS queue to know the number of tasks to perform
Trigger the state machine that contains the task definition to be performed following the maximum number of MaxConcurrency parameter. In the example the task is “Train a machine learning model” defined within the CC-ConcurrencyControlledStateMachine state machine.
Each task queries the SQS queue for the message, run the steps and once successful, deletes the message from the queue.

Once you started the execution, you can see in the SageMaker console how many jobs are triggered. It will not be more than the number stated in the MaxConcurrency parameter from the cloud formation template. This automated process will continue until all the messages (100) have been deleted. This means that the “Train a machine learning model” task will be triggered 100 times and when it is successful, it will delete the message from SQS queue which will drop the number of messages until reach 0.

How to adapt this architecture to your use case?

The architecture is an example that you can adapt to your own use case. The first thing you want to do is define the task itself which is within the CC-ConcurrencyControlledStateMachine state machine. In the example of this blog, the task is “Train a machine learning model”. You should include reading and deleting the SQS message within the task definition to keep the synchronization with the SQS queue. In the “Train a machine learning model”, I modified the Lambda code from the Generate dataset to read the message in the SQS queue. I also added an extra Lambda function (deleteSQS) at the end to delete the message read. The code example of each is shown below.

https://medium.com/media/e14de16b4a2e3c2274e0429219914507/href https://medium.com/media/0b0bd99046eb5e6aa16787b21bcf9fd4/href

The second thing you will need to adapt/create is the SQS queue. The SQS queue is there to pass the number of tasks and the task parameters according to your use case. For example, let us say you have a simple SM training job as a task and you want to trigger 25 SM training jobs but each with different training data located in S3. The SQS queue will contain 25 messages which each contains the necessary parameters for the training jobs to be successful. Each message could include the parameters such as the data location, job name, instance type, artifacts location, script location, etc.

The third thing you might want to add/modify in your state machine logic is the retry policies. It is not mandatory but recommended. In the case of SM this is very useful since it can manage the throttling limits when you trigger multiple SM jobs at the same time or any other error that could happen.

Conclusion

We have seen in this blog an example on how to trigger concurrent Amazon SageMaker jobs at scale and in a controlled manner with AWS Step Functions. While it brings the advantage of control over the workflow, the downside to be aware of this architecture is the Step Functions inline map state limited to 40 parallel iterations at a time. This is translated in the architecture by having a maximum of 40 tasks running at the same time. You can go around this by adapting the State Machine to use the new Distributed Map which supports up to 10,000 executions in parallel!

I have explained about adapting this example to your own use case through:

Defining the task itself within the CC-ConcurrencyControlledStateMachine state machine.
Adding SQS commands to collect the message with its parameters and making sure that you delete the message once it has been successfully completed.
Add retry policies within your task definition.

While we use in this example StepFunctions to run the ML Pipeline, we could instead have StepFunctions calling an existing SageMaker Pipeline and then see the ongoing pipeline running within SM Studio. You would then have StepFunctions controlling the workload concurrency and SM Pipelines running and controlling the ML workload. This is specially useful if you are mainly developing and working in SM Studio. SM Pipeline has as well the advantage of being more adapted for MLOps. As a future step, I have planned to update this blog and add the example of SM Pipelines into the architecture.

Use Amazon Athena in a processing job with Amazon SageMaker

Benoit de Patoul — Thu, 03 Mar 2022 11:09:30 GMT

A guide on how to configure an Amazon SageMaker processing job in conjunction with Amazon Athena.

Amazon SageMaker Processing allows to analyze data and evaluate machine learning models on Amazon SageMaker with a fully managed infrastructure. It downloads the data at the file level to process into the processing container by providing its location in Amazon Simple Storage Service ( S3).

Amazon Athena is an interactive query service that makes it easy to analyze/query data in Amazon S3 using standard SQL. It is an add-on managed capability in SageMaker Processing thanks to the AthenaDatasetDefinition parameter. It can be used to query data in a processing job adding an extra layer between the user querying the data and the S3. It brings the following advantages:

filtering data before downloading it into the processing job with a SQL query
fine-grained access to databases and tables in the AWS Glue Data Catalog
Workgroups to control query access and costs

There are currently no solutions or examples on the internet about how to use a SageMaker processing job in conjunction with Athena. In this blog we will see how to configure a processing job that uses Athena to query a data source, and uses a data processing script. It covers two ways to do it, by using Boto3 and by using the Amazon SageMaker python SDK. There is a code example available for both ways.

The workflow is:

The processing job sends the SQL query to Athena.
Athena queries the data source registered with the AWS Glue Data Catalog.
The results of the query are saved in S3.
The processing job downloads the S3 results into the deployed container to process the data.

Architecture and workflow

Prerequisites

To go through and apply this guide you need to have:

an existent data source in your Amazon Athena
access to that data source with the role used by Amazon SageMaker
the python SDK and the Boto3 installed
a file containing your processing data code (your script) saved into S3 (if you use Boto3)

Boto3

API call configuration

Following the SageMaker boto3 we will use the API call to generate a processing job. We do not need all of the parameters from the API call to generate a basic processing job. The following code shows you how to configure a processing job that uses Amazon Athena to query a database with SQL, apply the file containing the data processing code on the data queried, and finally save the results into S3. Some things are worth to mention for this blog.

The default parameters and API requirements such as AppSpecification, ProcessingOutputConfig, ProcessingJobName, ProcessingResources and RoleArn remain the same than when given an S3 location as data input. The ProcessingOutputConfig contains the location of the processing job results. The ProcessingJobName is the job name. The ProcessingResources designates the hardware resources. The AppSpecification contains the link to the pre-built Amazon Docker Image from Elastic Container Registry (links to containers can be found in the SageMaker documentation) and the location of the script.

The ProcessingInputs parameter is where we modify the data source to use Athena rather than the default S3. It contains two inputs, the Athena dataset definition and the data processing script. The Athena dataset definition is the parameter that defines with a SQL query the data that will be downloaded into the processing container. To help you fill up properly the Athena dataset definition according to your use case, I have added an image below showing where you can get the different parameters to define the dataset.

https://medium.com/media/9fdf155bd7111e0e596e858183cbfe6d/href

Athena Dataset Definition

Example

Let us first set up the context. I have Athena that queries a data source with the configuration shown on the image above. I have a python file located in S3 called ‘preprocessing.py’ containing a simple data processing script that separates the data into training, validation and testing datasets. The region is us-east-1 and will use the scikit-learn framework. I want the results to be saved in S3. The configured API call example can be found at this link.

Amazon SageMaker Python SDK

API call configuration

Amazon SageMaker Python SDK is an open source library for training and deploying ML models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks, algorithms provided by Amazon, or your own algorithms built into SageMaker-compatible Docker images.

The code below starts importing the necessary libraries and then defining the scikit-learn framework (you could use spark if you prefer). The next step is to define the Athena dataset which is then used by the DatasetDefinition. Once you have the dataset configured, you can run the processing job in the scikit-learn container that you previously defined with the processing data script.

https://medium.com/media/f3fc30a63c2c904fead7f3fd00d8da4a/href

Example

We will use the same context and same script as previously defined in the Boto3 example but now applied to the SageMaker SDK. There is a small add up on this example that could be applied to the Boto3 one. There are three ProcessingOutput (saved in S3) which selects each of the datasets generated by the script: the training, validation and test datasets.

The Boto3 example saves the results (train.csv, validation.csv and test.csv) including the directories within the ‘/opt/ml/processing/output’ in S3. It will then create 3 directories (train, validation and test) within the S3 location where you asked to save the results. The SageMaker SDK example saves the files (train.csv, validation.csv and test.csv) in S3 without their directories. You can find the code example in this link

Conclusion

We have learned here how to successfully run a processing job with Boto3 and SageMaker SDK by using Amazon Athena as a datasource, a data processing script and a pre-built container from SageMaker.

When you query data with Athena via a processing job for both cases (SDK and Boto3), the Athena query results are saved in S3 and are after downloaded to the processing job. With time, you might not need this data once you have processed it. You could after each processing job delete the queried data manually, but this can be time consuming. To solve this you might want to implement an object lifecycle rule in your S3 bucket that will delete automatically not used data according to your configuration. Please see the documentation for more information and examples.