Deploy a Servable Question Answering Model Using TensorFlow Serving

Joyce Y.
5 min readJun 14, 2019

--

Keywords: tensorflow-serving, bert, QA-model

Photo from unspalsh.com

I. Fine-tune a pre-trained model

Before reading this article, it is highly recommended that you already know how to fine tune a BERT base model with QA dataset such as SQuAD. Therefore, you will have 2 saved models at hand before final deployment.

The BERT base model is downloaded from its official github and it included config file (bert_config.json), vocab file (vocab.txt) as well as checkpoints.

After fine-tuned, the new model will update checkpoint files. The model stopped at step 43439. In addition, graph file graph.pbtxt is also obtained at this saved model. The original BERT doesn’t include this file while it is a must!

II. Export Model

Now we can write some code to export this new model. First we need to write an input function that define what kind of data model is expected to receive.

To write input_receiver function, we need to pay attention to feature data type such as tf.int64 or tf.float32 or tf.string. Here, the raw input is paragraph + question therefore it is tf.string. But then data will be parsed into features. This process can be reference here. To write parsed feature_spec, it need to be consistent with squad model feature spec definition.

Then we define an estimator the same as in model training and can call tensorflow serving api to export saved model. What is important here is that if we don’t have TPU machine on model server, we need change ._export_to_tpu to False. By default, because the BERT model uses TPUEstimator so this attribute is True.

estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=False,
model_fn=model_fn,
config=run_config,
train_batch_size=6,
predict_batch_size=8)
estimator._export_to_tpu = False ## !!important to add thisestimator.export_saved_model(
export_dir_base = EXPORT_PATH,
serving_input_receiver_fn = serving_input_receiver_fn)

After exported, we will obtain a new model object in EXPORT_PATH. Now this model object is servable. You will see standard file structures that TensorFlow serving requires: a pb file and a folder named variables.

You can check your exported model with script from tensorflow python tools. Below we can confirm that inputs tensor is DT_STRING and also understand how outputs will be. Of course, we need to parse outputs into answer span, that will be covered later.

python tensorflow/python/tools/saved_model_cli.py show --dir=$YOUR EXPORTED MODEL$ --allsignature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['examples'] tensor_info:
dtype: DT_STRING
shape: (6)
name: input_example_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['end_logits'] tensor_info:
dtype: DT_FLOAT
shape: (6, 128)
name: unstack:1
outputs['start_logits'] tensor_info:
dtype: DT_FLOAT
shape: (6, 128)
name: unstack:0
outputs['unique_ids'] tensor_info:
dtype: DT_INT64
shape: (6)
name: ParseExample/ParseExample:3
Method name is: tensorflow/serving/predict

III. Start a Model Server

Now you can upload the exported model object to your model server and start serving. But if you can also use your local machine to continue with this experiment. The newest tensorflow serving make serving very simple. What you need is install Docker and pull their docker images. Here I don’t go to details, which can be checked at Docker official site and tensorflow serving github.

Now we can start our docker server. Yes just one command line. Here the “source” means your exported model dir and “target” is point to model path inside docker container and it is the same with “model_base_path”.

docker run -p 8500:8500 --mount type=bind,source=$YOUR_EXPORT_MODEL_PATH$,target=/models/bert-qa -t tensorflow/serving --model_base_path=/models/bert-qa --model_name=bert-qa

IV. Start a Model Client

Because tensorflow serving provides docker image, it saves lots of time on model server. However, we have to write model client by ourselves, which might take some time.

First, we need to write an input parsing function that satisfy model and an output parsing function as well as result format function that satisfy format of span answer.

The read_squad_data, FeatureWriter and convert_examples_to_features are all given by BERT official github.

RawResult, write_predictions are also provided. Here we little bit explain output processing step. First, the model server output is tensor that seen in signature_def above. We get the value from tensor and build RawResult as formatted result. As we can see, the raw result is just the probability of each start position and end position among all passage positions. Then we can get the best prediction after sorting all scores. Just need to know that the start position will be put before the end position is guaranteed in write_prediction function.

Then, we create channel and stub that reachable to model server to send request.

## hostport = "127.0.0.1:8500"
channel = grpc.insecure_channel(hostport)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
model_request = predict_pb2.PredictRequest()
model_request.model_spec.name = 'bert-qa'
string_record = tf.python_io.tf_record_iterator(path=predict_file)model_request.inputs['examples'].CopyFrom(
tf.contrib.util.make_tensor_proto(string_record,
dtype=tf.string,
shape=[batch_size])
)
result_future = stub.Predict.future(model_request, 30.0)
raw_result = result_future.result().outputs

The model name is the same as we defined in model server when we started docker container. The predict_file is the file name that saved the inputs features when we define process_inputs above. Then the raw result can be passed in output process function above and we will see final answers soon.

V. Finally, Call your Model

Now your model is ready to serve! User can call your model client with payload as below and client to request model server to do inference. Raw result sent back over gRPC and your client process into final answer.

{
"options": {
"n_best": true,
"n_best_size": 3,
"max_answer_length": 30
},
"data": [
{
"id": "001",
"question": "Who invented LSTM?",
"context": "Many aspects of speech recognition were taken over by a deep learning method called long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997.[51] LSTM RNNs avoid the vanishing gradient problem and can learn \"Very Deep Learning\" tasks[2] that require memories of events that happened thousands of discrete time steps before, which is important for speech. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[52] Later it was combined with connectionist temporal classification (CTC)[53] in stacks of LSTM RNNs.[54] In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search."
}
]
}

You will see final answer as below.

{
"response": {
"latency": 1.1053588390350342,
"result": [
{
"best_prediction": "Hochreiter and Schmidhuber",
"id": "001",
"n_best_predictions": [
{
"end_logit": 5.893029689788818,
"probability": 0.9980743356799491,
"start_logit": 6.029672145843506,
"text": "Hochreiter and Schmidhuber"
},
{
"end_logit": -0.4450709819793701,
"probability": 0.00176425249733271,
"start_logit": 6.029672145843506,
"text": "Hochreiter and Schmidhuber in 1997."
},
{
"end_logit": 5.893029689788818,
"probability": 0.00016141182271812068,
"start_logit": -2.6999518871307373,
"text": "Schmidhuber"
}
],
"question": "Who invented LSTM?"
}
],
"status": "SUCCESS"
},
"status": 200
}

--

--

Joyce Y.

A Machine Learning Practitioner solving real world problems.