Thanks to faster compute, better storage and easy to use software, deep learning based solutions are definitely seeing the light of the day coming out from the proof-of-concept tunnel into the real-world! We are seeing widespread adoption of deep learning models across diverse domains in the industry including healthcare, finance, retail, tech, logistics, food-tech, agriculture amongst many others! Considering the fact that deep learning models are resource hungry and often compute-heavy, we need to pause for a moment and think about model inference and serving times, when consumed by end-users.
Training and performing model inference on static batches of data while prototyping is necessary. However this methodology and code artifacts don’t make the cut when we want our model to be consumed in the form of a web service or API. We usually need a robust and low-latency model serving layer which should be able to serve model inference requests faster and at ease. This article tries to give a simple, yet comprehensive hands-on overview of how to leverage TensorFlow Serving to serve deep learning models for computer vision. The article will be pretty extensive and will also contain a lot of hands-on code which you can adopt for your own practices. We will be covering the following topics in this article:
- What is Serving?
- TensorFlow Serving Overview
- TensorFlow Serving Architecture
- Model Serving Methodology
- Main Objective — Building an Apparel Classifier
- Training a simple CNN Model
- Fine-tuning a pre-trained ResNet-50 CNN Model
- Saving models for TensorFlow Serving
- Serving models with CPU Inference
- Serving models with Docker for GPU Inference
- Bonus: Building an Apparel Classifier API with Flask & TensorFlow Serving
Feel free to reach out to me if you have better ways of making deployments and serving simpler and easier to understand. I’m by no means an expert but I sincerely hope you find this useful. Let’s get started!
What is Serving?
Serving, or to be more specific, model serving, is a technique to use or apply a model for inference after it has been trained. Typically this involves having a server-client architecture and serving or exposing our trained models for inference.
Considering the problem we would be focusing on in this article — image classification, we would have some specific access patterns for our models. On the end-user or client end we would have an input image which need to be tagged or classified. We would need to convert this image to a specific encoded format, wrap it in a specific JSON payload with headers and send it to a web service \ API which should typically be hosted on a server. The API call would invoke our pre-trained model to make the prediction and serve the inference result as a JSON response from the server to the client.
TensorFlow Serving Overview
There are many excellent articles for TensorFlow serving including the official documentation which you should definitely check out. In this section I will give a brief overview of the essentials of TensorFlow Serving and why we need it. For productionizing deep learning or machine learning models, we need a robust system which can help us in making our models serve requests with speed and consistency. TensorFlow Serving is such a framework, which is a flexible, high-performance serving system for machine learning models, designed specifically for production environments.
While you can always build your own serving pipeline and system, there are several benefits to using TensorFlow Serving.
- It is very easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs.
- Can not only be used with just TensorFlow models but can be easily extended to serve other types of models and data.
- Can be used to serve multiple models and model versions simultaneously.
- Efficient model lifecycle management.
- Can be integrated with tools like Docker and Kubernetes for more scalability.
TensorFlow Serving Architecture
In this section we will take a brief look into the essentials of the architecture behind TensorFlow Serving. For a detailed deep dive into the architecture, I recommend checking out the official documentation once again.
Servables are the central abstraction in the TensorFlow Serving architecture. Typically servables are the underlying objects that clients use to perform computations like model inference. A single Servable might include one model or even multiple models. The following figure showcases the typical life of a Servable.
Typically there are several components in this architecture. A servable stream is the sequence of versions of a servable. Loaders manage a servable’s life cycle and has APIs for loading and unloading a servable. Sources are plugin modules that find and provide servables and can maintain state that is shared across multiple servables. Managers handle the full lifecycle of servables, including loading, unloading and serving servables. Using the standard TensorFlow Serving APIs, TensorFlow Serving Core manages the lifecycle and metrics of servables.
Model Serving Methodology
Before we talk about our main objective and train our models, let’s discuss briefly the model serving methodology we would follow, assuming we have trained a few vision-based deep learning models. The key steps to be followed are displayed in the following figure.
Thus the key steps we would be focusing on for serving models include the following:
- Model Training: To serve any models, we would need to train the models first! In this article we will leverage the
tf.kerasAPI in TensorFlow which helps us train deep learning models easily.
- Exporting the models: Here we would need to export our trained models into a specific format which can be used by TF Serving. TensorFlow provides the
SavedModelformat as a universal format for exporting models. This creates a protobuf file in a well-defined directory hierarchy, and will also include a version number as depicted below.
TensorFlow Serving allows us to select which version of a model, or “servable” we want to use when we make inference requests. Each version will be exported to a different sub-directory under the given path as depicted in the figure above. TensorFlow provides a convenience function
tf.saved_model.simple_save() which helps us in saving these models easily.
- Hosting TensorFlow Serving Model Server: Here we would be using the TensorFlow Serving framework to host our saved models. We will focus on locally based TF Serving installation for CPU inference and also show how we can use a Docker container-based TF Serving instance for GPU inference. We also leverage the Flask framework on top of TF Serving to build our own custom serving API at the end of this article.
- Making Server Requests: Once the server up and running, we can make requests to it either via gRPC or HTTP. For both the methods, we typically create a payload message with necessary content and headers, and send it to the server. The server in turn should return a message that contains the prediction. We will use the
requestsmodule for HTTP requests.
Main Objective — Building an Apparel Classifier
We will keep things simple here with regard to the key objective. We will build a simple apparel classifier by training models on the very famous Fashion MNIST dataset based on Zalando’s article images — consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. The idea is to classify these images into an apparel category amongst 10 categories on which we will be training our models on.
We will build the following two deep learning CNN (Convolutional Neural Network) classifiers in this article before focusing on model serving.
- A simple CNN trained from scratch
- Fine-tuning a pre-trained ResNet-50 CNN
The intent of this article is to focus more on deployment and serving aspects hence we won’t spend a whole lot of time talking about model architectures or training and fine-tuning. Feel free to check out a brief on CNNs if necessary towards the very end of this article. My testbed for this article was a Google Cloud Platform Deep Learning VM with a NVIDIA Tesla T4 which makes it really easy to carry out experiments on the cloud!
Loading Dependencies and Data
Before we can train our deep learning models, let’s load up the necessary dependencies and our dataset to train our deep learning models.
We use TensorFlow GPU version 1.14 here but this can easily be also extended to TensorFlow 2.0 considering we focus mostly on the
tf.keras API. We can leverage TensorFlow itself now to load up the Fashion-MNIST dataset.
Train_images.shape: (60000, 28, 28), of uint8
Test_images.shape: (10000, 28, 28), of uint8
Based on what we mentioned earlier, we have 60000 training and 10000 test images of size
28x28. We will now start training our deep learning models.
Training a simple CNN Model
In this section, we will train a basic 2-layer CNN model from scratch. We do need to reshape our data before we train our model and the following code takes care of the same.
Train_images.shape: (60000, 28, 28, 1), of uint8
Test_images.shape: (10000, 28, 28, 1), of uint8
We can also take a look at how some of the images look like, as depicted in the following snapshot.
We will now build our basic 2-layer CNN model architecture.
Let’s train our model for 10 epochs and look at the performance.
Do note that we train on 90% of the training data and validate on 10% of the training data. Performance is pretty decent on the validation set. Let’s save our model and then check the performance on the test dataset.
Overall model performance on the test dataset gives us an f1-score of 91% which is pretty neat!
Fine-tuning a pre-trained ResNet-50 CNN Model
Transfer learning is seeing unprecedented success in the world of computer vision and natural language processing with pre-trained models often outperforming training models from scratch. Here, we will use a ResNet-50 model which was pre-trained on the ImageNet dataset by fine-tuning it on the Fashion-MNIST dataset. The ResNet-50 model is a 50-convolutional block (several layers in each block) deep learning network built on the ImageNet database. This model has over 175+ layers in total and is a very deep network. ResNet stands for Residual Networks.
The ResNet-50 model which we will be using, consists of 5 stages, each with a Convolution and Identity block. Each convolution block has 3 convolution layers and each identity block also has 3 convolution layers. You can find some more information about this model in one of my articles. The focus in this section will be taking the pre-trained ResNet-50 model and then perform complete fine-tuning of all the layers in the network. We will add the regular dense and output layers as usual.
This model is huge and you can see the evidence based on the number of trainable parameters! Before we can train the model, we need to convert our grayscale images into images with three channels since the ResNet model was trained on color images. Besides that, the minimal dimensions acceptable by the ResNet model is 32x32 so we need to resize our images.
Train_images.shape: (60000, 32, 32, 3), of float32
Test_images.shape: (10000, 32, 32, 3), of float32
Let’s train our model for 10 epochs now similar to our previous model.
Do note that as in the previous model, we train on 90% of the training data and validate on 10% of the training data. Performance looks to be much better on the validation set. Let’s save our model and then check the performance on the test dataset.
Overall we get an f1-score of 92% which is better than our first model! Most tutorials would end here but, you could say that ours will begin here, since the steps needed to enable model serving start now!
Saving models for TensorFlow Serving
We had briefly discussed this in the model serving methodology. To serve models using TensorFlow Serving, we need to save them into the
SavedModel format. Thanks to the very nifty
tf.saved_model.simple_save(…) function, we can do this within a few lines.
Before we move on to setting up TensorFlow Serving, we can leverage TensorFlow’s
SavedModel command line interface (CLI) tool,
saved_model_cli, which is useful to quickly inspect the input and output specifications of our model.
The above output shows details pertaining to our second model including input and output specifications.
Serving models with CPU Inference
In this section, we will show how to serve our saved models using TensorFlow serving leveraging our CPUs. We will do a local installation in our system, however I recommend you to use the Docker installation based setup for TF Serving which is easier to use and maintain since you only have to pull in the container using the following command and you don’t need to setup any configurations or dependencies.
docker pull tensorflow/serving
However, to show the different options out here, we will also show how you can setup TF Serving locally.
Install TensorFlow Serving
Here we showcase the necessary steps to install TF Serving locally. Do note this only consists of the CPU version. To start with we add the source to obtain TF Serving.
!echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" \
| sudo tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg \
| sudo apt-key add -
Next, you can remove the existing version of TF Serving if present using the following command.
!sudo apt-get -y remove tensorflow-model-serverReading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
0 upgraded, 0 newly installed, 1 to remove and 39 not upgraded.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 124893 files and directories currently installed.)
Removing tensorflow-model-server (1.13.0) ...
Next, we install TF Serving using the following command
!sudo apt-get update && sudo apt-get install tensorflow-model-serverHit:1 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease
Preparing to unpack .../tensorflow-model-server_1.14.0_all.deb
Unpacking tensorflow-model-server (1.14.0) ...
Setting up tensorflow-model-server (1.14.0) ...
Start Serving Models with TensorFlow Serving
We are now ready to start serving models with TF Serving. We need to setup a model serving config file if we want to customize the number of models we want to serve and other advanced configurations which you can read about here. Our configuration file is simple and depicted in the following snippet which is stored in
models.conf. Do note that we want to serve both our models, hence the config file makes things easier.
You can run the following code directly in Jupyter Notebooks to start the server as a background process or you can run it from the terminal also.
You can see based on the server log messages that our server has successfully loaded both the models and they are ready for serving.
Serving Model Inference Requests
Now that our models are ready for serving, let’s first start with serving model inference requests using our first model, the basic CNN.
Looks like TF Serving is able to serve model inference requests properly! Do note that we have to pre-process our image and create the proper payload before sending the request to the server.
Let’s take a look at serving model inference requests for the same sample of images using our second model, the ResNet-50 CNN.
Looks like our model is serving requests and the predictions are better than our previous model!
While productionizing and serving models, an important thing to remember is that the TensorFlow runtime has components that are lazily initialized, which can cause high latency for the first request(s) sent to a model after it is loaded. This latency can be several orders of magnitude higher than that of a single inference request. Hence it is good to warmup the models by sending a few sample records as an initial request after loading the model. You can do this at model load time by checking out the following documentation.
Here, we will follow a simple approach of sending a sample request to each of our models to warm it up after it is loaded. For this, we save some sample data in a file which we can load and use later for warming up our models as showcased in the following code.
Model 1 warmup complete
Model 2 warmup complete
Benchmarking Model Serving Requests
Let’s take a look at how long it takes to serve model inference request for a bunch of images. For this, we will consider 10000 images from our test dataset. Do note we are sending only a single request and seeing the inference time of the whole batch. We will look at multiple requests in the next section. Let’s take a look at how our first model performs.
CPU times: user 7.64 s, sys: 612 ms, total: 8.26 s
Wall time: 9.76 sOut: 10000
We were able to serve the request of inference for 10000 images in around 9.8 seconds which is pretty good considering the model performed the inference using a CPU. Let’s check out the performance of our second model now.
CPU times: user 21.6 s, sys: 1.38 s, total: 23 s
Wall time: 36.4 sOut: 10000
Time to serve the request considering CPU inference was 36 seconds which is not too bad!
Serving models with GPU Inference
In this section, we will show how to serve our saved models using TensorFlow serving leveraging our GPU. The idea is, if you have a GPU, use it! We will leverage Docker to setup our TensorFlow Serving system.
Pulling TF Serving GPU image
Assuming you have docker installed in your system or on the cloud. You can using the following code to pull in the latest version of TF Serving on GPUs
!docker pull tensorflow/serving:latest-gpulatest-gpu: Pulling from tensorflow/serving
Status: Image is up to date for tensorflow/serving:latest-gpu
You can check if the image is present in your system using the following command
Start Serving Models with Docker TensorFlow Serving GPU
We are now ready to start serving models with TF Serving. We will do that by running the docker image we just downloaded. You can do it directly from Jupyter Notebooks also using the following code. In practice it is better to run it from the terminal.
You can then use the following command in Docker to check if the container is up and running.
!docker ps -all
Finally, you can check the logs in Docker to verify everything is working perfectly.
!docker logs 7d4b091ccefa | tail -n 15
This verifies the fact that TF Serving will be using the GPU on our system for inference!
We can leverage our previously implemented code to warmup our models. Here we will focus more on our complex 2nd CNN model so we warmup Model 2.
Model 2 warmup complete
Benchmarking Model Serving Requests
Let’s take all our 10000 test images and send a single request to check model serving time for inference using our GPU. Do note that we only focus on our 2nd model here.
CPU times: user 23.5 s, sys: 1.87 s, total: 25.3 s
Wall time: 31.3 sOut: 10000
Total inference time is a bit over 30 seconds for 10000 images, pretty decent! Let’s now perform a real benchmarking test. Consider 10000 separate requests for inference of a single apparel image to be classified each time. How much time would it take for TF Serving to serve these 10000 requests?
100%|██████████| 10000/10000 [01:55<00:00, 86.56it/s]CPU times: user 39.1 s, sys: 2.2 s, total: 41.3 s
Wall time: 1min 55sOut: 10000
It takes a total of 115 seconds to serve 10000 requests. Which means TF Serving serves each request in roughly 11.5 milliseconds. Pretty good!
Let’s try an interesting comparison now. We will use the regular
model.predict(…) API call from
tf.keras to see how long it takes to serve 10000 requests.
100%|██████████| 10000/10000 [03:04<00:00, 54.12it/s]CPU times: user 3min 8s, sys: 17.2 s, total: 3min 25s
Wall time: 3min 4sOut: 10000
It takes a total of 184 seconds to serve 10000 requests. Which means using the native model prediction API, we are able to serve each request in roughly 18.4 milliseconds.
This showcases the need and importance of leveraging TF Serving especially when productionizing models!
Bonus: Building an Apparel Classifier API with Flask & TensorFlow Serving
While TF Serving is extremely useful and provides us with a high-performance system to serve inference requests. Considering an end-to-end perspective, you might have noticed that model serving is not just dumping some data as requests to the server. We need to access the image data, pre-process it and then send it in an appropriate format to TF Serving. Also once we get back the response, we need to access the class probabilities, get the class with the maximum probability and then get the corresponding apparel class label.
The best way to put all these steps together is to leverage a robust framework like Flask to build a web service / API on top of TF Serving to accept images from the real-world, perform necessary pre-processing, call TF Serving, post-process the response and then send the final JSON response to the end-user. Do note that we can even dockerize and deploy the Flask API on Kubernetes or use a WSGI server like Gunicorn to scale and improve performance.
Create API with Flask
We will start by creating our own apparel API leveraging Flask, you will find the code files in my GitHub repository, however we will also cover the code here for the sake of completeness and ease of understanding.
We store this file as
app.py in our server which forms the base of our API.
Start Docker Container for TF Serving
Up next, check and restart the docker container for TF Serving if it’s not already up and running.
!docker start 7d4b091ccefa
!docker ps -all
Start our Apparel Classifier Web Service
Now, we need to start our web service. In production, it is recommended you do NOT use the default web server provided by Flask but a better production-ready WSGI server, just like Gunicorn. We start our web service using the following command from the terminal.
We leverage multiple workers to serve more requests as needed. Let’s now check if our API is live using the liveness test endpoint.
(200, 'API Live!')
Serve Sample Apparel Classification with Web Service
Let’s now take a sample real-world image and try to use our web service for performing classification. The image is depicted with the following code.
So this is clearly the image of a Sneaker. Let’s leverage our API to serve the model prediction. Do remember we are encoding any input image to the Base64 format and then decoding it and pre-processing the same in the server side before performing model inference.
We finally get the right apparel category in the form of a JSON response. Things are working out exactly the way we want!
Benchmark our Web Service
Considering web server latency, image processing, model inference and serving, let’s check out how much time it takes to process 10000 requests now.
100%|██████████| 10000/10000 [05:26<00:00, 30.66it/s]CPU times: user 1min, sys: 3.17 s, total: 1min 4s
Wall time: 5min 26sOut : 10000
Inference time per image: 32.599999999999994 ms
We are able to serve each request in 32.6 ms which is not too bad! Can you improve on this even more? Feel free to let us know your thoughts!
I hope this slightly lengthy, yet comprehensive article gives you an idea about how building models and prototyping on notebooks is very different from actually productionizing models in the real-world. Always think of the complete end-to-end picture even when training your models. This helps you envision and implement your own inference system much faster once your models are trained. Hope this also gives you an idea that while TensorFlow Serving might seem complex, once you get started with it, you can train, save and serve your models pretty easily and it works well with frameworks like Docker, Flask, Kubernetes and so on. Feel free to let us know your own tips and tricks and methods to deploy deep learning models in production.
I am a Google Developer Expert in Machine Learning and I look forward to sharing more interesting aspects of machine learning and deep learning over time. Feel free to check out my Medium and LinkedIn for updates on interesting content!