The Beginner’s Guide: Deep Learning Model Deployment (TensorFlow Serving with Docker & YOLO)

6 min readOct 13, 2021

This tutorial is tested on Ubuntu 18.04. All the commands in this tutorial will be done inside the “terminal”. GPU is not required.

Purpose: Deploy real-time machine learning algorithm efficiently with production-ready setup. An example each from both inference server and client side will be provided.

Required Modules:

A trained model
Production model conversion script
TensorFlow Serving
Docker
Inference query script

Benefits: (over inference with your training code)

Latency. TensorFlow Serving’s backend is written in C++ (No C++ programming is needed) and highly optimized, so it will be a lot faster than Python.
Scalability. Instead of having to load your model into memory for each python instance (if you are providing services for multiple clients), you can load only once from the centralized inference server.
Efficiency. An extension of the first two points, with less computation time, and computation resources, you are less likely to run into overheating issues.
Portability. The known advantage of Docker, allowing you to take the “image” created, then deploy on any other machine with compatible hardware & software configurations.

Since YOLO, the object detection AI, is one of the most popular models, this tutorial will use it as our deployment example (We will use YOLOv4).

1. Production Model Format

Production stacks often have specific requirements for directory structure and model format.

Workflow: Download Model -> Convert Model -> Restructure Directory

Let’s download the YOLOv4 model here. My downloads are saved in

~/Downloads/

2. Convert the model using the save_model.py script from this github repo by

git clone https://github.com/hunglc007/tensorflow-yolov4-tflite.git

Then go inside the repo folder you have just downloaded:

cd tensorflow-yolov4-tflite

Now you can convert the downloaded model with this command:

python save_model.py --weights ~/Downloads/yolov4.weights --output ./checkpoints/yolov4-416 --input_size 416 --model yolov4

Then your production-ready model is saved in the repo folder’s “checkpoints” folder, or

~/tensorflow-yolov4-tflite/checkpoints/yolov4-416

3. Restructure the directory for production serving

In the “checkpoints” folder you should see the following structure:

├── yolov4-416
│   ├── assets
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index

You will need to created a new sub-directory, and relocate everything from the second level. The correct structure should be:

├── yolov4-416
│   └── 00000001   <----------this is the new sub-directory
│       ├── assets
│       ├── saved_model.pb
│       └── variables
│           ├── variables.data-00000-of-00001
│           └── variables.index

Note that I’ve created a “00000001” sub-folder inside the “yolov4–416” folder. You can rename it to something else if you wish (i.e. the current timestamp)

2. Docker Installation

I’m not aiming for technical correctness but, in my opinion, Docker is like parallel-desktop, which is a separate (or virtualized) OS instance. The main difference is that Docker creates an “image” that is highly portable. So once you have configured your Docker “OS image” properly you can put this image in an USB and run it on any compatible machines.

To know if you have Docker installed or not, type this in the terminal:

docker

If you see something like this, you can skip to the next section.

Otherwise, install with the following commands:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install -y docker-ce

To verify your installation is successful:

sudo docker run hello-world

A few handy Docker commands

exit  # exit docker when you are inside the image
sudo docker ps  # see which docker container is active
sudo docker kill <cotainerID>  # To terminate a specific container

3. TensorFlow Serving Installation

TensorFlow Serving is the module needed for running highly optimized inference for the production environment.

For CPU inference:

docker pull tensorflow/serving

For GPU inference:

# Installation cleanups
sudo docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

# Add the package repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey |sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install Nvidia's Docker version
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Restart Docker
sudo systemctl restart docker

Test if your Nvidia Docker can access your GPU (Note that I’m using CUDA10.0), if you have a different version, modify the code below. When in doubt for CUDA/TF/GPU version compatibility, check out my other article.

# Test nvidia-smi with the latest official CUDA image
sudo docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi

If successful, it should output:

NOTE: It does say CUDA version 10.1, but this is a know issue from nvidia-smi

Final step for GPU inference setup:

I would suggest to get the specific gpu-serving version that is the same as your python TensorFlow version used for training (instead of the latest one recommended on the official TensorFlow site).

sudo docker pull tensorflow/serving:1.14.0-gpu

4. Deploy Production Model

Time to deploy! Run the following for CPU inference:

For this example:

set container name to be production_yolo
our model location is in ~/tensorflow-yolov4-tflite/checkpoints/yolov4–416
“target” your desire location for the model inside the container, and we will have it as /models/yolo_detection
the image for this container is tensorflow/serving

sudo docker run 
-p 8501:8501 
--name production_yolo 
--mount type=bind,source=~/tensorflow-yolov4-tflite/checkpoints/yolov4–416,target=/models/yolo_detection 
-e MODEL_NAME=yolo_detection 
-t tensorflow/serving

Please reformat the above command within 1 line, by removing newlines (I did it for readability).

Deploying with GPU

sudo docker run 
--gpus all -p 8501:8501 
--name production_yolo 
--mount type=bind,source=~/tensorflow-yolov4-tflite/checkpoints/yolov4–416,target=/models/yolo_detection 
-e MODEL_NAME=yolo_detection 
-t tensorflow/serving:1.14.0-gpu

5. Accessing Production Model

Now that it has been deployed, we will need to access it in a python script. Instead of getting your predictions with “predictions = model.predict(X)”, we will use REST API to query the server.

For REST API, we will be using the POST method:

requests.post(url, data)

The production server’s url is

# Generic format: http://localhost:8501/v1/<target_name>:predict
url = 'http://localhost:8501/v1//models/yolo_detection:predict'

The production server will then return a JSON formatted string.

# Image processing before inference query
image_data = cv2.resize(frame, (input_width, input_height)) / 255.
image_data = image_data[np.newaxis, ...].astype(np.float32)

# Inference w/ REST API 
json_response = requests.post(url,
                              data=json.dumps({
                                 "signature_name":"serving_default",
                                 "instances":image_data.tolist()
                              }),
                              headers={
                                  "content-type":"application/json"
                              })
predictions = tf.constant(json.loads(json_response.text)['predictions'])

— END —