Pay as you use SageMaker Serverless inference with GPT-2

Image from cloudops
  1. Using Serverless framework
    Two options:
    * Create a Lambda layer (which contains dependency libraries) and attach it to Lambda function.
    * Using Docker container (for example; host Hugging Face BERT models, Image Classification models on S3 and serve it through serverless framework and Lambda functions)
  2. Using AWS CDK (Cloud Development Kit)
  3. Using AWS SAM (Serverless Application Model)
    Host Deep Learning models on S3, load it on to EFS (like storing models on cache) and serve the inference requests.
    Two options:
    * Using SAM Helloworld template - Create a Lambda function with code and API gateway trigger.
    * Using SAM Machine Learning template - Create a docker container with all code then attach it to Lambda function and create an API gateway trigger.
  4. Using SageMaker Serverless inference

SageMaker Serverless inference Use cases

  • Use this option when you don’t often receive inference requests the entire day, such as customer feedback service or chatbot applications or analyze data from documents and tolerate cold start problems.
  • Serverless endpoints automatically launch compute resources and scale them in and out based on the workload. You can pay only for invocations and save a lot of cost.

Warming up the Cold Starts

  • You can create a health-check service to load the model but do not use the model and you can invoke that service periodically or when users are still exploring the application.
  • Use the AWS CloudWatch to keep our lambda service warm.

Import necessary libraries and Setup permissions

NOTE: You can run this demo in SageMaker Studio, your local machine, or SageMaker Notebook Instances

Dev environment and permissions (Screenshot by Author)

Retrieve Model Artifacts

GPT-2 model

Retrieve GPT-2 model artifacts (Screenshot by Author)

Write the Inference Script

GPT-2 model

Inference script for GPT-2 model (Screenshot by Author)

Package Model

For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named “model.tar.gz” with gzip compression. Within the archive, the Hugging Face container expects all inference code files to be inside the code/ directory.

Package models (Screenshot by Author)

Upload GPT-2 model to S3

Upload model to S3 (Screenshot by Author)

Create and Deploy a Serverless GPT-2 model

We are using a CPU based Hugging Face container image to host the inference script, GPUs are not supported in Serverless endpoints and hopefully the AWS team will add GPUs to Serverless endpoints soon 😄.

Define the DLC (Screenshot by Author)
Create SM endpoint config and endpoint (Screenshot by Author)

Get Predictions

GPT-2 model

Get predictions from GPT-2 model (Screenshot by Author)

Monitor Serverless GPT-2 model endpoint

The ModelSetupTime metric helps you to track the time (cold start time) it takes to launch new compute resources to setup Serverless endpoint. It depends on size of the model and container start up time.

CloudWatch metrics: First inference request (Screenshot by Author)
CloudWatch metrics: Second inference request (Screenshot by Author)
Memory Utilization (Screenshot by Author)


Clean up (Screenshot by Author)


We successfully deployed GPT-2 (text generation model) to Amazon SageMaker Serverless endpoint using the SageMaker boto3 API.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vinayak Shanawad

Vinayak Shanawad

Machine Learning Engineer | 3x Kaggle Expert | Learning, improving and evolving.