Serving Machine Learning Model in Production — Step-by-Step Guide

Ilnur Garifullin
Published in
5 min readSep 10, 2018


General purpose machine learning encoder-decoder systems allow us to use them in many areas such as Machine Translation, Text Summarization, Conversational Modeling, Image Captioning and more. Models created for different fields might require different preparational steps in order to deploy them into production environment. In this article we will deploy and serve a pre-trained Sequence-to-Sequence (Seq2Seq) model with an attention mechanism, developed as a conversation modelling system.


For the model we will use a TensorFlow implementation, taken from conversation-tensorflow repository. As a serving system we will use ML Lambda open source environment.

$ pip install hs==0.1.3 tensorflow==1.9.0 nltk==3.3 hb-config==0.5.1 tqdm==4.24.0
  • Clone Seq2Seq model:
$ git clone
  • Train the model on a small sample data:
$ python conversation-tensorflow/ --config check-tiny --mode train
  • Clone ML Lambda:
$ git clone
  • Set up local server:
$ cd hydro-serving
$ docker-compose up -d
  • Create a hs cluster, that will work with the local server:
$ hs cluster add --name local --server http://localhost
$ hs cluster use local

Those steps will set up the whole environment for you. To check if ML Lambda is running, go to http://localhost/ in your browser.


Typical pipeline of the conversation application consists of taking a string, encoding it to a vector of integers, passing the vector through the model, receiving other vector as an inference result and transforming it back to the string. This transformation is performed by means of manually built vocabulary (as in our case), which consists of [integer, string] pairs. Serving such a pipeline can be performed in two ways:

  1. Creating a monolithic model by slightly modifying the original model, i.e. including string transformations to input/output tensors of the model.
  2. Creating a multi-staged model which will consist of the processing stages, and our inferencing model.

In this article we will continue with the monolithic model for the sake of explanation of the deployment simplicity.

All the code presented in this article is available on GitHub.

Preparing and Serving Seq2Seq

The chosen model is built upon tf.estimator. This is a high-level TensorFlow API that simplifies machine learning programming. We will need to modify the estimator a little since it wasn’t prepared for the serving. To do that we have to define a contract (export_outputs) for exporting model to a servable (tf.saved_model) format. Also, since we decided to create a monolithic model, we would need to put string-to-vector transformations inside the model.

Model Preparation

Let’s create an file, that will export our model into a SavedModel.

$ cd ../conversation-tensorflow
$ touch

For starters, let’s define a function, that will encode our vocabulary into a tf.contrib.lookup table. The vocabulary is placed in the conversation-tensorflow/data/tiny_processed_data/vocab file. The function would produce required tables.

There’s an option for creating the index_table from file without the need to manually upload vocabulary and transform it to tensor. Since our vocabulary is relatively small, we’ve decided to store it as a part of the graph; you may want to consider tf.contrib.lookup.index_table_from_file.

Seq2Seq model is presented by the class Conversation, defined in file. It would be a good practice to not change the original class and only manipulate an extended class from it. In the extended class we will override model_fn and _init_placeholder functions and add two supplement functions _preprocess_input and _postprocess_output. Let’s define the structure of our class.

model_fn will stay mostly the same, we just add tables and padding to the class.

Additional parameter export_outputs specifies the output type of our model for the export [1] and self._postprocess_output wraps the output tensor self.predictions [2].

Next, let’s modify incoming tensors in our model.

This function will take incoming tensor of shape (None, 1) and type tf.string, and convert it to a tensor of shape (1, 60) and type tf.int64.

And we need to change outcoming tensor back to the original shape.

And as a final step, lets modify _init_placeholder function.

Saving model is performed in a following way:

Note, that the crucial points here for us are: creating an estimator [3], preparing a serving_input_reciever_fn [4] and exporting the model [5]. The first two steps are only necessary for defining internal model parameters.

Model Serving

Now that our model is ready, we can export it to a tf.saved_model and upload it to ML Lambda.

$ python --config check_tiny

This command will create a my_model/${TIMESTAMP} folder inside current directory with the exported model.

$ cd my_model/${TIMESTAMP}
$ hs upload --name seq2seq

This will upload our exported model to ML Lambda. Now, let’s create an application to infer the model. Open http://localhost/, you’ll see a seq2seq model uploaded in the Model’s section. Open Applications page, click Add New button and fill up the following fields.

As a model select the uploaded seq2seq model, as a runtime select serving-runtime-tensorflow runtime and create an application.

That’s it, you’ve just served a sequence-to-sequence model that you can use via different interfaces such as REST or RPC API.

To test, that everything works fine, click Test button, and send it some sentence.

Calling the application via REST or RPC interfaces is described in a documentation. As an example, we can invoke created application via gRPC.


outputs {
key: "output"
value {
dtype: DT_STRING
tensor_shape {
unknown_rank: true
string_val: "sweet"
string_val: "<\\s>"


In this article we discussed how a single TensorFlow model can be served in a production environment. As an example we chose conversation-tensorflow model, written on a high-level tf.estimator API. The main issue with this model, is that it cannot properly tokenize the input sentences, since TensorFlow is not built for this task. In the next article we will show the issue resolution with the preprocessing script and serving the preprocessing and the model in a consequent pipeline.