Pay as you use SageMaker Serverless inference with GPT-2

#HuggingFace #AWS #GPT-2 #SageMaker #Mlops

Vinayak Shanawad
5 min readAug 1, 2022
Image from cloudops

I’m very excited to write this article. Huge credits 👏 to AWS team for making SageMaker Serverless inference option generally available.

Lately, I’ve been looking for hosting Machine Learning models on serverless infrastructure and found that there are multiple ways in which we can achieve that.

  1. Using Serverless framework
    Two options:
    * Create a Lambda layer (which contains dependency libraries) and attach it to Lambda function.
    * Using Docker container (for example; host Hugging Face BERT models, Image Classification models on S3 and serve it through serverless framework and Lambda functions)
  2. Using AWS CDK (Cloud Development Kit)
  3. Using AWS SAM (Serverless Application Model)
    Host Deep Learning models on S3, load it on to EFS (like storing models on cache) and serve the inference requests.
    Two options:
    * Using SAM Helloworld template - Create a Lambda function with code and API gateway trigger.
    * Using SAM Machine Learning template - Create a docker container with all code then attach it to Lambda function and create an API gateway trigger.
  4. Using SageMaker Serverless inference

The problem with the first three options is that we have to build, manage, and maintain all your containers.

I found that SageMaker (SM) Serverless inference option allows you to focus on the model building process without having to manage the underlying infrastructure. You can choose either a SM in-built container or bring your own.

SageMaker Serverless inference Use cases

  • Use this option when you don’t often receive inference requests the entire day, such as customer feedback service or chatbot applications or analyze data from documents and tolerate cold start problems.
  • Serverless endpoints automatically launch compute resources and scale them in and out based on the workload. You can pay only for invocations and save a lot of cost.

Reference: AWS Documentation

Warming up the Cold Starts

  • You can create a health-check service to load the model but do not use the model and you can invoke that service periodically or when users are still exploring the application.
  • Use the AWS CloudWatch to keep our lambda service warm.

This article will demonstrate how to host pretrained transformers models: GPT-2 model on SageMaker Serverless endpoint using SageMaker boto3 API.

NOTE: At the time of writing this only CPU Instances are supported for Serverless Endpoint.

Import necessary libraries and Setup permissions

NOTE: You can run this demo in SageMaker Studio, your local machine, or SageMaker Notebook Instances

If you are going to use SageMaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for SageMaker. You can find here more about it.

Dev environment and permissions (Screenshot by Author)

Retrieve Model Artifacts

GPT-2 model

We will download the model artifacts for the pretrained GPT-2 model. GPT-2 is a popular text generation model that was developed by OpenAI. Given a text prompt it can generate synthetic text that may follow.

Retrieve GPT-2 model artifacts (Screenshot by Author)

Write the Inference Script

GPT-2 model

In the next cell we’ll see our inference script for the GPT-2 model.

Inference script for GPT-2 model (Screenshot by Author)

Package Model

For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named “model.tar.gz” with gzip compression. Within the archive, the Hugging Face container expects all inference code files to be inside the code/ directory.

Package models (Screenshot by Author)

Upload GPT-2 model to S3

Upload model to S3 (Screenshot by Author)

Create and Deploy a Serverless GPT-2 model

We are using a CPU based Hugging Face container image to host the inference script, GPUs are not supported in Serverless endpoints and hopefully the AWS team will add GPUs to Serverless endpoints soon 😄.

Define the DLC (Screenshot by Author)

Next we will create a SageMaker model, endpoint config and endpoint. We have to specify “ServerlessConfig” which contains two parameters MemorySizeInMB and MaxConcurrency while creating endpoint config. This is the only difference we have in Serverless endpoint otherwise everything remains same as we do in Real-time inference.

MemorySizeInMB: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. The memory size should be at least as large as your model size.

MaxConcurrency: The maximum number of concurrent invocations your serverless endpoint can process.

Create SM endpoint config and endpoint (Screenshot by Author)

Get Predictions

GPT-2 model

Now that our Serverless endpoint is deployed, we can send it text to get predictions from our GPT-2 model. You can use the SageMaker Python SDK or the SageMaker Runtime API to invoke the endpoint.

Get predictions from GPT-2 model (Screenshot by Author)

Monitor Serverless GPT-2 model endpoint

The ModelSetupTime metric helps you to track the time (cold start time) it takes to launch new compute resources to setup Serverless endpoint. It depends on size of the model and container start up time.

Serverless endpoint takes around 12 secs to host the GPT-2 model with available compute resources and takes around 3.9 secs to serve the first inference request.

CloudWatch metrics: First inference request (Screenshot by Author)

Serverless GPT-2 model endpoint is serving subsequent inference requests within 1 sec which is great news 🙌.

CloudWatch metrics: Second inference request (Screenshot by Author)

Serverless endpoint utilizes 16.14% of the memory.

Memory Utilization (Screenshot by Author)

Clean-up

Clean up (Screenshot by Author)

Conclusion

We successfully deployed GPT-2 (text generation model) to Amazon SageMaker Serverless endpoint using the SageMaker boto3 API.

The big advantage of Serverless endpoint is that your Data Science team is focusing on the model building process and not spending thousands of dollars while implementing a POC or at the start of a new Product. After the POC is successful, you can easily deploy your model to real-time endpoints with GPUs to handle production workload.

The complete source code for this article is available in github repo

Thanks for reading!! Let me know if you have any questions.

--

--

Vinayak Shanawad

Machine Learning Engineer | 3x Kaggle Expert | MLOps | LLMOps | Learning, improving and evolving.