Seamless Integration: Deploying FastAPI ML Inference Code with SageMaker BYOC + Nginx

Imran Nazir
6 min readJun 26


Photo by Kevin Ku on Unsplash

With the advancement of ML models and the introduction of SageMaker Inference, the need to deploy ML inference code using our own containers in SageMaker has become evident. However, existing examples and blogs often use Flask API with Gunicorn as the WSGI HTTP server and Nginx as the reverse proxy.

In a SageMaker inference environment, Nginx enhances Gunicorn by offering load balancing, caching, SSL termination, and request handling capabilities. This combination optimizes performance, scalability, and security, allowing efficient deployment and management of machine learning models for inference tasks in SageMaker.

Due to the growing popularity of FastAPI, developers may find it challenging to port their entire FastAPI inference code to Flask API. This blog aims to address this issue by demonstrating how to use SageMaker BYOC (Bring Your Own Container) to deploy FastAPI ML inference code.

For this purpose, we will refer to the GitHub repository at, which is an extension of the Medium blog written by Ashmi Banerjee. The original blog post can be found at I acknowledge and appreciate the author, Ashmi Banerjee, for providing the valuable insights and tutorial on deploying an ML model in production using FastAPI.

💥 Challenges with Sagemaker BYOC and FastAPI

There are certain challenges when using SageMaker BYOC with FastAPI. SageMaker BYOC runs the code on Gunicorn, which is an application server that adheres to the WSGI standard. While Gunicorn can serve applications like Flask and Django, it is not directly compatible with FastAPI since FastAPI relies on the newer ASGI standard. However, Gunicorn can function as a process manager and allows users to specify the worker process class to use.

By combining Gunicorn and Uvicorn, which has a Gunicorn-compatible worker class, we can achieve compatibility. Gunicorn acts as a process manager, listening on the designated port and IP, and forwards the communication to the worker processes running the Uvicorn class. The Gunicorn-compatible Uvicorn worker class is responsible for converting the data transmitted by Gunicorn into the ASGI standard, which FastAPI can utilize.

Follow this link for more info:-

By following this approach, we can effectively deploy FastAPI ML inference code using SageMaker BYOC in a production environment.

🚀 Start of the Porting

So to run the above code we need to port the inference code to sagemaker-compatible code with the FastAPI and Nginx.

1 ) ➕ Add the below nginx.conf and files in the src folder

worker_processes 1;
daemon off; # Prevent forking

pid /tmp/;
error_log /var/log/nginx/error.log;

events {
# defaults

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /var/log/nginx/access.log combined;

upstream gunicorn {
server unix:/tmp/gunicorn.sock;

server {
listen 8080 deferred;
client_max_body_size 1024m; # Max Value

keepalive_timeout 3600; # Max Value
proxy_read_timeout 3600s; # Max Value

location ~ ^/(ping|invocations) {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://gunicorn;

location / {
return 404 "{}";

2) ➕ Add

#!/usr/bin/env python
"""Create Nginx/Gunicorn entrypoint for Sagemaker."""
# This file implements the scoring service shell. You don't necessarily need to modify it for various
# algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until
# gunicorn exits.
# The flask server is specified to be the app object in
# We set the following parameters:
# Parameter Environment Variable Default Value
# --------- -------------------- -------------
# number of workers MODEL_SERVER_WORKERS the number of CPU cores
# timeout MODEL_SERVER_TIMEOUT 60 seconds

import os
import signal
import subprocess
import sys
from uvicorn.workers import UvicornWorker

cpu_count = 1

model_server_timeout = 3600
model_server_workers = int(cpu_count)

def sigterm_handler(nginx_pid: int, gunicorn_pid: int) -> None:
"""Kill nginx and gunicorn processes using SIGTERM.

nginx_pid : int
process id of nginx
gunicorn_pid : int
process id of gunicorn
os.kill(nginx_pid, signal.SIGQUIT)
except OSError:
os.kill(gunicorn_pid, signal.SIGTERM)
except OSError:


def start_server() -> None:
"""Start checkbox_detector Extractor's Nginx/Gunicorn server."""
print(f"Starting the inference server with {model_server_workers} workers.")

# link the log streams to stdout/err so they will be logged to the container logs
subprocess.check_call(["ln", "-sf", "/dev/stdout", "/var/log/nginx/access.log"])
subprocess.check_call(["ln", "-sf", "/dev/stderr", "/var/log/nginx/error.log"])

nginx = subprocess.Popen(["nginx", "-c", "/opt/program/src/nginx.conf"])
gunicorn = subprocess.Popen(
"unix:/tmp/gunicorn.sock", # Update the socket path

signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(,

# If either subprocess exits, so do we.
pids = {,}
while True:
pid, _ = os.wait()
if pid in pids:

print("Inference server exiting")

# The main routine just invokes the start function.

if __name__ == "__main__":

Please note here I have converted the src & app to python package. Follow this link for more info

3) 🔄 Change the Requesting Serving Stack
Amazon SageMaker uses two URLs in the container:

  • /ping receives GET requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
  • /invocations is the endpoint that receives client inference POST requests. The format of the request and the response is up to the algorithm. If the client supplied ContentType and Accept headers, these are passed in as well.

Therefore change the accordingly inside src directory

@app.get('/ping', status_code=status.HTTP_200_OK)
def ping():
return {"message": "ok"}"/invocations", status_code=200)
async def predict_torch(request: Img):
prediction = torch_run_classifier(request.img_url)
if not prediction:
# the exception is raised, not returned - you will get a validation
# error otherwise.
raise HTTPException(
status_code=404, detail="Image could not be downloaded"


return {"status_code": 200,
"predicted_label": prediction[0],
"probability": prediction[1]}

4) 📦Update the requirements.txt

# Add this with other dependencies

5) ⬆️Update the DockerFile

FROM ubuntu:20.04

RUN apt update -y && apt upgrade -y && \
apt-get install -y python3-pip && \
apt-get install -y gcc

# Needed to skip interactive pytz installation
RUN ln -snf /usr/share/zoneinfo/$CONTAINER_TIMEZONE /etc/localtime && echo $CONTAINER_TIMEZONE > /etc/timezone

RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Symlink python3.8 to python
RUN ln -s /usr/bin/python3.8 /usr/bin/python

RUN pip install Pillow

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.

ENV PATH="/opt/program:${PATH}"

WORKDIR /opt/program

COPY ./requirements.txt /opt/program/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /opt/program/requirements.txt
COPY /src /opt/program/src/

RUN chmod +x /opt/program/src/

ENTRYPOINT ["python", "/opt/program/"]

See how we are using the /opt/program directory and adding Entrypoint for sagemaker to start the BYOC

You can find the above codes in the below Repo:-

📁 The final Directory would look like this

Now, you can follow all the steps to create a sagemaker endpoint config and endpoint creation to deploy the endpoint.
To proceed with creating the SageMaker endpoint configuration and deploying the endpoint, please follow the steps outlined in the provided link. It will guide you through the necessary procedures.

To ensure the operational efficiency of your deployed model, it is essential to conduct thorough monitoring. AWS SageMaker offers integrated model monitoring tools within the AWS ecosystem, such as CloudWatch Logs and CloudWatch Metrics.

If you found the post helpful then please do Clap 👏

As an MLOps developer experienced in building ML serving tools architecture, deep learning model deployments, and CI/CD automation, my content will cover a wide range of topics related to machine learning operations. Expect discussions on best practices for deploying and scaling models, optimizing performance, ensuring reproducibility, and integrating ML into existing systems. I’ll also delve into CI/CD pipelines for efficient model updates. Follow me here for more insights into the evolving field of MLOps and stay updated with the latest advancements.

“Let me know in comments what topic you want me to cover in Sagemaker endpoints.”



Imran Nazir

I fuse my passion for cutting-edge technology and love for coding. As an avid MLOps enthusiast, I'm here to unravel the secrets of Deep Learning