Pay as you use SageMaker Serverless inference with GPT-2
#HuggingFace #AWS #GPT-2 #SageMaker #Mlops
I’m very excited to write this article. Huge credits 👏 to AWS team for making SageMaker Serverless inference option generally available.
Lately, I’ve been looking for hosting Machine Learning models on serverless infrastructure and found that there are multiple ways in which we can achieve that.
- Using Serverless framework
Two options:
* Create a Lambda layer (which contains dependency libraries) and attach it to Lambda function.
* Using Docker container (for example; host Hugging Face BERT models, Image Classification models on S3 and serve it through serverless framework and Lambda functions) - Using AWS CDK (Cloud Development Kit)
- Using AWS SAM (Serverless Application Model)
Host Deep Learning models on S3, load it on to EFS (like storing models on cache) and serve the inference requests.
Two options:
* Using SAM Helloworld template - Create a Lambda function with code and API gateway trigger.
* Using SAM Machine Learning template - Create a docker container with all code then attach it to Lambda function and create an API gateway trigger. - Using SageMaker Serverless inference
The problem with the first three options is that we have to build, manage, and maintain all your containers.
I found that SageMaker (SM) Serverless inference option allows you to focus on the model building process without having to manage the underlying infrastructure. You can choose either a SM in-built container or bring your own.
SageMaker Serverless inference Use cases
- Use this option when you don’t often receive inference requests the entire day, such as customer feedback service or chatbot applications or analyze data from documents and tolerate cold start problems.
- Serverless endpoints automatically launch compute resources and scale them in and out based on the workload. You can pay only for invocations and save a lot of cost.
Reference: AWS Documentation
Warming up the Cold Starts
- You can create a health-check service to load the model but do not use the model and you can invoke that service periodically or when users are still exploring the application.
- Use the AWS CloudWatch to keep our lambda service warm.
This article will demonstrate how to host pretrained transformers models: GPT-2 model on SageMaker Serverless endpoint using SageMaker boto3 API.
NOTE: At the time of writing this only CPU
Instances are supported for Serverless Endpoint.
Import necessary libraries and Setup permissions
NOTE: You can run this demo in SageMaker Studio, your local machine, or SageMaker Notebook Instances
If you are going to use SageMaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for SageMaker. You can find here more about it.
Retrieve Model Artifacts
GPT-2 model
We will download the model artifacts for the pretrained GPT-2 model. GPT-2 is a popular text generation model that was developed by OpenAI. Given a text prompt it can generate synthetic text that may follow.
Write the Inference Script
GPT-2 model
In the next cell we’ll see our inference script for the GPT-2 model.
Package Model
For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named “model.tar.gz” with gzip compression. Within the archive, the Hugging Face container expects all inference code files to be inside the code/
directory.
Upload GPT-2 model to S3
Create and Deploy a Serverless GPT-2 model
We are using a CPU based Hugging Face container image to host the inference script, GPUs are not supported in Serverless endpoints and hopefully the AWS team will add GPUs to Serverless endpoints soon 😄.
Next we will create a SageMaker model, endpoint config and endpoint. We have to specify “ServerlessConfig” which contains two parameters MemorySizeInMB and MaxConcurrency while creating endpoint config. This is the only difference we have in Serverless endpoint otherwise everything remains same as we do in Real-time inference.
MemorySizeInMB: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. The memory size should be at least as large as your model size.
MaxConcurrency: The maximum number of concurrent invocations your serverless endpoint can process.
Get Predictions
GPT-2 model
Now that our Serverless endpoint is deployed, we can send it text to get predictions from our GPT-2 model. You can use the SageMaker Python SDK or the SageMaker Runtime API to invoke the endpoint.
Monitor Serverless GPT-2 model endpoint
The ModelSetupTime
metric helps you to track the time (cold start time) it takes to launch new compute resources to setup Serverless endpoint. It depends on size of the model and container start up time.
Serverless endpoint takes around 12 secs to host the GPT-2 model with available compute resources and takes around 3.9 secs to serve the first inference request.
Serverless GPT-2 model endpoint is serving subsequent inference requests within 1 sec which is great news 🙌.
Serverless endpoint utilizes 16.14% of the memory.
Clean-up
Conclusion
We successfully deployed GPT-2 (text generation model) to Amazon SageMaker Serverless endpoint using the SageMaker boto3 API.
The big advantage of Serverless endpoint is that your Data Science team is focusing on the model building process and not spending thousands of dollars while implementing a POC or at the start of a new Product. After the POC is successful, you can easily deploy your model to real-time endpoints with GPUs to handle production workload.
The complete source code for this article is available in github repo
Thanks for reading!! Let me know if you have any questions.