Serving Machine Learning Model in Production — Step-by-Step Guide
--
General purpose machine learning encoder-decoder systems allow us to use them in many areas such as Machine Translation, Text Summarization, Conversational Modeling, Image Captioning and more. Models created for different fields might require different preparational steps in order to deploy them into production environment. In this article we will deploy and serve a pre-trained Sequence-to-Sequence (Seq2Seq) model with an attention mechanism, developed as a conversation modelling system.
Prerequisites
For the model we will use a TensorFlow implementation, taken from conversation-tensorflow repository. As a serving system we will use ML Lambda open source environment.
$ pip install hs==0.1.3 tensorflow==1.9.0 nltk==3.3 hb-config==0.5.1 tqdm==4.24.0
- Clone Seq2Seq model:
$ git clone https://github.com/DongjunLee/conversation-tensorflow
- Train the model on a small sample data:
$ python conversation-tensorflow/main.py --config check-tiny --mode train
- Clone ML Lambda:
$ git clone https://github.com/Hydrospheredata/hydro-serving
- Set up local server:
$ cd hydro-serving
$ docker-compose up -d
- Create a hs cluster, that will work with the local server:
$ hs cluster add --name local --server http://localhost
$ hs cluster use local
Those steps will set up the whole environment for you. To check if ML Lambda is running, go to http://localhost/ in your browser.
Basics
Typical pipeline of the conversation application consists of taking a string, encoding it to a vector of integers, passing the vector through the model, receiving other vector as an inference result and transforming it back to the string. This transformation is performed by means of manually built vocabulary (as in our case), which consists of [integer, string] pairs. Serving such a pipeline can be performed in two ways:
- Creating a monolithic model by slightly modifying the original model, i.e. including string transformations to input/output tensors of the model.
- Creating a multi-staged model which will consist of the processing stages, and our inferencing model.
In this article we will continue with the monolithic model for the sake of explanation of the deployment simplicity.
All the code presented in this article is available on GitHub.
Preparing and Serving Seq2Seq
The chosen model is built upon tf.estimator
. This is a high-level TensorFlow API that simplifies machine learning programming. We will need to modify the estimator a little since it wasn’t prepared for the serving. To do that we have to define a contract (export_outputs
) for exporting model to a servable (tf.saved_model
) format. Also, since we decided to create a monolithic model, we would need to put string-to-vector transformations inside the model.
Model Preparation
Let’s create an export.py
file, that will export our model into a SavedModel
.
$ cd ../conversation-tensorflow
$ touch export.py
For starters, let’s define a function, that will encode our vocabulary into a tf.contrib.lookup
table. The vocabulary is placed in the conversation-tensorflow/data/tiny_processed_data/vocab
file. The function would produce required tables.
There’s an option for creating the index_table
from file without the need to manually upload vocabulary and transform it to tensor. Since our vocabulary is relatively small, we’ve decided to store it as a part of the graph; you may want to consider tf.contrib.lookup.index_table_from_file
.
Seq2Seq model is presented by the class Conversation
, defined in model.py
file. It would be a good practice to not change the original class and only manipulate an extended class from it. In the extended class we will override model_fn
and _init_placeholder
functions and add two supplement functions _preprocess_input
and _postprocess_output
. Let’s define the structure of our class.
model_fn
will stay mostly the same, we just add tables and padding to the class.
Additional parameter export_outputs
specifies the output type of our model for the export [1] and self._postprocess_output
wraps the output tensor self.predictions [2].
Next, let’s modify incoming tensors in our model.
This function will take incoming tensor of shape (None, 1)
and type tf.string
, and convert it to a tensor of shape (1, 60)
and type tf.int64
.
And we need to change outcoming tensor back to the original shape.
And as a final step, lets modify _init_placeholder
function.
Saving model is performed in a following way:
Note, that the crucial points here for us are: creating an estimator [3], preparing a serving_input_reciever_fn
[4] and exporting the model [5]. The first two steps are only necessary for defining internal model parameters.
Model Serving
Now that our model is ready, we can export it to a tf.saved_model
and upload it to ML Lambda.
$ python export.py --config check_tiny
This command will create a my_model/${TIMESTAMP}
folder inside current directory with the exported model.
$ cd my_model/${TIMESTAMP}
$ hs upload --name seq2seq
This will upload our exported model to ML Lambda. Now, let’s create an application to infer the model. Open http://localhost/, you’ll see a seq2seq
model uploaded in the Model’s section. Open Applications
page, click Add New
button and fill up the following fields.
As a model select the uploaded seq2seq model, as a runtime select serving-runtime-tensorflow
runtime and create an application.
That’s it, you’ve just served a sequence-to-sequence model that you can use via different interfaces such as REST or RPC API.
To test, that everything works fine, click Test button, and send it some sentence.
Calling the application via REST or RPC interfaces is described in a documentation. As an example, we can invoke created application via gRPC.
Result:
outputs {
key: "output"
value {
dtype: DT_STRING
tensor_shape {
unknown_rank: true
}
string_val: "sweet"
string_val: "<\\s>"
}
}
Conclusion
In this article we discussed how a single TensorFlow model can be served in a production environment. As an example we chose conversation-tensorflow
model, written on a high-level tf.estimator
API. The main issue with this model, is that it cannot properly tokenize the input sentences, since TensorFlow is not built for this task. In the next article we will show the issue resolution with the preprocessing script and serving the preprocessing and the model in a consequent pipeline.