Seamless Integration: Deploying FastAPI ML Inference Code with SageMaker BYOC + Nginx
--
With the advancement of ML models and the introduction of SageMaker Inference, the need to deploy ML inference code using our own containers in SageMaker has become evident. However, existing examples and blogs often use Flask API with Gunicorn as the WSGI HTTP server and Nginx as the reverse proxy.
In a SageMaker inference environment, Nginx enhances Gunicorn by offering load balancing, caching, SSL termination, and request handling capabilities. This combination optimizes performance, scalability, and security, allowing efficient deployment and management of machine learning models for inference tasks in SageMaker.
Due to the growing popularity of FastAPI, developers may find it challenging to port their entire FastAPI inference code to Flask API. This blog aims to address this issue by demonstrating how to use SageMaker BYOC (Bring Your Own Container) to deploy FastAPI ML inference code.
For this purpose, we will refer to the GitHub repository at https://github.com/ashmibanerjee/img-classifier-fastapi, which is an extension of the Medium blog written by Ashmi Banerjee. The original blog post can be found at https://medium.com/@ashmi_banerjee/4-step-tutorial-to-serve-an-ml-model-in-production-using-fastapi-ee62201b3db3. I acknowledge and appreciate the author, Ashmi Banerjee, for providing the valuable insights and tutorial on deploying an ML model in production using FastAPI.
💥 Challenges with Sagemaker BYOC and FastAPI
There are certain challenges when using SageMaker BYOC with FastAPI. SageMaker BYOC runs the code on Gunicorn, which is an application server that adheres to the WSGI standard. While Gunicorn can serve applications like Flask and Django, it is not directly compatible with FastAPI since FastAPI relies on the newer ASGI standard. However, Gunicorn can function as a process manager and allows users to specify the worker process class to use.
By combining Gunicorn and Uvicorn, which has a Gunicorn-compatible worker class, we can achieve compatibility. Gunicorn acts as a process manager, listening on the designated port and IP, and forwards the communication to the worker processes running the Uvicorn class. The Gunicorn-compatible Uvicorn worker class is responsible for converting the data transmitted by Gunicorn into the ASGI standard, which FastAPI can utilize.
Follow this link for more info:-
By following this approach, we can effectively deploy FastAPI ML inference code using SageMaker BYOC in a production environment.
🚀 Start of the Porting
So to run the above code we need to port the inference code to sagemaker-compatible code with the FastAPI and Nginx.
1 ) ➕ Add the below nginx.conf and serve_app.py files in the src folder
worker_processes 1;
daemon off; # Prevent forking
pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;
events {
# defaults
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /var/log/nginx/access.log combined;
upstream gunicorn {
server unix:/tmp/gunicorn.sock;
}
server {
listen 8080 deferred;
client_max_body_size 1024m; # Max Value
keepalive_timeout 3600; # Max Value
proxy_read_timeout 3600s; # Max Value
location ~ ^/(ping|invocations) {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://gunicorn;
}
location / {
return 404 "{}";
}
}
}
2) ➕ Add serve_app.py
#!/usr/bin/env python
"""Create Nginx/Gunicorn entrypoint for Sagemaker."""
# This file implements the scoring service shell. You don't necessarily need to modify it for various
# algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until
# gunicorn exits.
#
# The flask server is specified to be the app object in wsgi.py
#
# We set the following parameters:
#
# Parameter Environment Variable Default Value
# --------- -------------------- -------------
# number of workers MODEL_SERVER_WORKERS the number of CPU cores
# timeout MODEL_SERVER_TIMEOUT 60 seconds
import os
import signal
import subprocess
import sys
from uvicorn.workers import UvicornWorker
cpu_count = 1
model_server_timeout = 3600
model_server_workers = int(cpu_count)
def sigterm_handler(nginx_pid: int, gunicorn_pid: int) -> None:
"""Kill nginx and gunicorn processes using SIGTERM.
Parameters
----------
nginx_pid : int
process id of nginx
gunicorn_pid : int
process id of gunicorn
"""
try:
os.kill(nginx_pid, signal.SIGQUIT)
except OSError:
pass
try:
os.kill(gunicorn_pid, signal.SIGTERM)
except OSError:
pass
sys.exit(0)
def start_server() -> None:
"""Start checkbox_detector Extractor's Nginx/Gunicorn server."""
print(f"Starting the inference server with {model_server_workers} workers.")
# link the log streams to stdout/err so they will be logged to the container logs
subprocess.check_call(["ln", "-sf", "/dev/stdout", "/var/log/nginx/access.log"])
subprocess.check_call(["ln", "-sf", "/dev/stderr", "/var/log/nginx/error.log"])
nginx = subprocess.Popen(["nginx", "-c", "/opt/program/src/nginx.conf"])
gunicorn = subprocess.Popen(
[
"gunicorn",
"--timeout",
str(model_server_timeout),
"-k",
"uvicorn.workers.UvicornWorker",
"--bind",
"unix:/tmp/gunicorn.sock", # Update the socket path
"src.app.app:app",
]
)
signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))
# If either subprocess exits, so do we.
pids = {nginx.pid, gunicorn.pid}
while True:
pid, _ = os.wait()
if pid in pids:
break
sigterm_handler(nginx.pid, gunicorn.pid)
print("Inference server exiting")
# The main routine just invokes the start function.
if __name__ == "__main__":
start_server()
Please note here I have converted the src & app to python package. Follow this link for more info
3) 🔄 Change the Requesting Serving Stack
Amazon SageMaker uses two URLs in the container:
/ping
receivesGET
requests from the infrastructure. Your program returns 200 if the container is up and accepting requests./invocations
is the endpoint that receives client inferencePOST
requests. The format of the request and the response is up to the algorithm. If the client suppliedContentType
andAccept
headers, these are passed in as well.
Therefore change the app.py accordingly inside src directory
@app.get('/ping', status_code=status.HTTP_200_OK)
def ping():
return {"message": "ok"}
@app.post("/invocations", status_code=200)
async def predict_torch(request: Img):
prediction = torch_run_classifier(request.img_url)
if not prediction:
# the exception is raised, not returned - you will get a validation
# error otherwise.
raise HTTPException(
status_code=404, detail="Image could not be downloaded"
)
return {"status_code": 200,
"predicted_label": prediction[0],
"probability": prediction[1]}
4) 📦Update the requirements.txt
......
# Add this with other dependencies
gunicorn
uvicorn[standard]
5) ⬆️Update the DockerFile
FROM ubuntu:20.04
RUN apt update -y && apt upgrade -y && \
apt-get install -y python3-pip && \
apt-get install -y gcc
# Needed to skip interactive pytz installation
RUN ln -snf /usr/share/zoneinfo/$CONTAINER_TIMEZONE /etc/localtime && echo $CONTAINER_TIMEZONE > /etc/timezone
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Symlink python3.8 to python
RUN ln -s /usr/bin/python3.8 /usr/bin/python
RUN pip install Pillow
# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
WORKDIR /opt/program
COPY ./requirements.txt /opt/program/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /opt/program/requirements.txt
COPY /src /opt/program/src/
RUN chmod +x /opt/program/src/serve_app.py
ENTRYPOINT ["python", "/opt/program/serve_app.py"]
See how we are using the /opt/program directory and adding Entrypoint for sagemaker to start the BYOC
You can find the above codes in the below Repo:-
📁 The final Directory would look like this
Now, you can follow all the steps to create a sagemaker endpoint config and endpoint creation to deploy the endpoint.
To proceed with creating the SageMaker endpoint configuration and deploying the endpoint, please follow the steps outlined in the provided link. It will guide you through the necessary procedures.
To ensure the operational efficiency of your deployed model, it is essential to conduct thorough monitoring. AWS SageMaker offers integrated model monitoring tools within the AWS ecosystem, such as CloudWatch Logs and CloudWatch Metrics.
If you found the post helpful then please do Clap 👏
As an MLOps developer experienced in building ML serving tools architecture, deep learning model deployments, and CI/CD automation, my content will cover a wide range of topics related to machine learning operations. Expect discussions on best practices for deploying and scaling models, optimizing performance, ensuring reproducibility, and integrating ML into existing systems. I’ll also delve into CI/CD pipelines for efficient model updates. Follow me here for more insights into the evolving field of MLOps and stay updated with the latest advancements.
“Let me know in comments what topic you want me to cover in Sagemaker endpoints.”