Use Amazon Athena in a processing job with Amazon SageMaker
A guide on how to configure an Amazon SageMaker processing job in conjunction with Amazon Athena.
Amazon SageMaker Processing allows to analyze data and evaluate machine learning models on Amazon SageMaker with a fully managed infrastructure. It downloads the data at the file level to process into the processing container by providing its location in Amazon Simple Storage Service ( S3).
Amazon Athena is an interactive query service that makes it easy to analyze/query data in Amazon S3 using standard SQL. It is an add-on managed capability in SageMaker Processing thanks to the AthenaDatasetDefinition parameter. It can be used to query data in a processing job adding an extra layer between the user querying the data and the S3. It brings the following advantages:
- filtering data before downloading it into the processing job with a SQL query
- fine-grained access to databases and tables in the AWS Glue Data Catalog
- Workgroups to control query access and costs
There are currently no solutions or examples on the internet about how to use a SageMaker processing job in conjunction with Athena. In this blog we will see how to configure a processing job that uses Athena to query a data source, and uses a data processing script. It covers two ways to do it, by using Boto3 and by using the Amazon SageMaker python SDK. There is a code example available for both ways.
The workflow is:
- The processing job sends the SQL query to Athena.
- Athena queries the data source registered with the AWS Glue Data Catalog.
- The results of the query are saved in S3.
- The processing job downloads the S3 results into the deployed container to process the data.
Prerequisites
To go through and apply this guide you need to have:
- an existent data source in your Amazon Athena
- access to that data source with the role used by Amazon SageMaker
- the python SDK and the Boto3 installed
- a file containing your processing data code (your script) saved into S3 (if you use Boto3)
Boto3
API call configuration
Following the SageMaker boto3 we will use the API call to generate a processing job. We do not need all of the parameters from the API call to generate a basic processing job. The following code shows you how to configure a processing job that uses Amazon Athena to query a database with SQL, apply the file containing the data processing code on the data queried, and finally save the results into S3. Some things are worth to mention for this blog.
The default parameters and API requirements such as AppSpecification
, ProcessingOutputConfig
, ProcessingJobName
, ProcessingResources
and RoleArn
remain the same than when given an S3 location as data input. The ProcessingOutputConfig
contains the location of the processing job results. The ProcessingJobName
is the job name. The ProcessingResources
designates the hardware resources. The AppSpecification
contains the link to the pre-built Amazon Docker Image from Elastic Container Registry (links to containers can be found in the SageMaker documentation) and the location of the script.
The ProcessingInputs
parameter is where we modify the data source to use Athena rather than the default S3. It contains two inputs, the Athena dataset definition and the data processing script. The Athena dataset definition is the parameter that defines with a SQL query the data that will be downloaded into the processing container. To help you fill up properly the Athena dataset definition according to your use case, I have added an image below showing where you can get the different parameters to define the dataset.
Example
Let us first set up the context. I have Athena that queries a data source with the configuration shown on the image above. I have a python file located in S3 called ‘preprocessing.py’ containing a simple data processing script that separates the data into training, validation and testing datasets. The region is us-east-1 and will use the scikit-learn framework. I want the results to be saved in S3. The configured API call example can be found at this link.
Amazon SageMaker Python SDK
API call configuration
Amazon SageMaker Python SDK is an open source library for training and deploying ML models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks, algorithms provided by Amazon, or your own algorithms built into SageMaker-compatible Docker images.
The code below starts importing the necessary libraries and then defining the scikit-learn framework (you could use spark if you prefer). The next step is to define the Athena dataset which is then used by the DatasetDefinition
. Once you have the dataset configured, you can run the processing job in the scikit-learn container that you previously defined with the processing data script.
Example
We will use the same context and same script as previously defined in the Boto3 example but now applied to the SageMaker SDK. There is a small add up on this example that could be applied to the Boto3 one. There are three ProcessingOutput
(saved in S3) which selects each of the datasets generated by the script: the training, validation and test datasets.
The Boto3 example saves the results (train.csv, validation.csv and test.csv) including the directories within the ‘/opt/ml/processing/output’ in S3. It will then create 3 directories (train, validation and test) within the S3 location where you asked to save the results. The SageMaker SDK example saves the files (train.csv, validation.csv and test.csv) in S3 without their directories. You can find the code example in this link
Conclusion
We have learned here how to successfully run a processing job with Boto3 and SageMaker SDK by using Amazon Athena as a datasource, a data processing script and a pre-built container from SageMaker.
When you query data with Athena via a processing job for both cases (SDK and Boto3), the Athena query results are saved in S3 and are after downloaded to the processing job. With time, you might not need this data once you have processed it. You could after each processing job delete the queried data manually, but this can be time consuming. To solve this you might want to implement an object lifecycle rule in your S3 bucket that will delete automatically not used data according to your configuration. Please see the documentation for more information and examples.