In Pursuit of elusive low latency for LSTM Timeseries Model Inference — via TorchScript and ONNX

Published in

exemplifyML.ai

8 min readOct 22, 2022

--

In pursuit of low latency for customized timeseries LSTM model inference — journey from python centric models to TorchScript and onwards to ONNX

In this tutorial we will take the customized LSTM model used for text generation as illustrated in a previous tutorial.

I chose to use a customized model, as this would provide insights on how to navigate the challenges involved in converting to TorchScript and ONNX formats.

We will be starting off with just running the trained model on a flask app and testing out the latency for a text generation inference.

Inputs Used:

Seed Length: 4 words
Text Generation length: 20 words
Seed Text: Movie is great.

Baseline:

The original trained model used as a baseline is referenced here.

Benchmark Tool: Apache Benchmark (ab)

For our initial test run, for a single executor, and max requests of 6, we had a lack luster P90 latency around 657ms.

Figure 1: Stats from Apache Benchmark Tool for 6 requests on original model (Image by Author)

Conversion of Original Model to Torchscript model:

The journey from a pythonic model to a TorchScript model, and onward to ONNX model, was an adventurous one. I predict a toupee in my foreseeable future 😄.

TorchScript is a pure C++ model. The conversion is done by mapping a small subset of python operations to C++. This mapping is performed by the ATen library.

In my opinion, TorchScript is a means to an end, as once we have a compliant TorchScript model, we should be able to export it to ONNX format; as usual caveats do apply.

Prior to the conversion, it is worth checking out what python and PyTorch operators are supported out of the box.

For converting a pythonic model to a TorchScript model, it can be approached in two ways as listed below.

trace — This is a preferred way of converting a python model to TorchScript model, this traces the flow of execution based on the sample inputs.
Note: If there are control flows in the model, this will not work. For a workaround; one can trace different control flows and aggregate them together.
script — This allows for converting a python model to TorchScript model with control flows.
This requires a higher level of effort though, as we are moving from a dynamic typed runtime to a static typed runtime(C++).
The variables, method parameters have to be typed or they will be inferred based on the sample input. For any reason, if the type cannot be inferred they will default to type Tensor:float.

Note: Whichever mode we choose, I cannot stress this enough, simplify, simply and simply the model prior to conversion.

Iteration 1:

The working TorchScript compatible model’s major changes are referenced below.

Changes:

All the variables and methods have data types
We have a multiple entry model, with Torchscript annotations
Method ‘forward’ — training entry point
Method ‘predict_all_steps’ — inference entry point

The whole model can be referenced here.

After running the benchmark test, I saw a modest decrease in latency for P90 — 603ms. Latency decreased by 54ms.

For the amount of effort in making the model TorchScript compatible, I had hoped to make impressive gains.

Figure 2: Stats from Apache Benchmark Tool for 6 requests on TorchScript model (Image by Author)

Moving Onwards to ONNX :

Although, moving from a TorchScript model to an ONNX model is easier, it is not without its challenges.

I used the this tutorial as a guide for converting the Torchscript model to ONNX.

Caveats: Not all Pytorch operators are supported in ONNX.

Limitations on Pytorch to ONNX conversion can be referenced here.
Supported ONNX List of operators with their their opset version is listed here.
Supported Pytorch / TorchScript to ONNX operators can be referenced here.

Iteration 2:

The refactored ONNX compatible model is referenced below.

Changes:

Issues with using for … range loops
No support for multiple entry point [Project onnx-mlir supports it, but that's a project for another time, it requires exporting the model as a shared object(.so)]
Issues with control flows
I had take out the control flows to make it work with ONNX. The refactored model has the default ‘forward’ entry point only; which is used for inference.
This has an impact on the maintainability of the model, although we can use the trained weights, we have 2 versions of the model now; one for training and one for inference.
No LSTMCell support — only one-layer LSTM supported referenced here.
Snippet for LSTMCell workaround referenced below. It essentially involves creating a wrapper around the single layer LSTM and using it as a single layer single step LSTM.
As one of the layers has changed, we need to retrain the model to get the proper weights for the changed layer.
Accuracy of the model needs to be re-validated as well.

The whole refactored ONNX compatible model can be referenced here.

Now that we have a ONNX compatible model, lets test out the inference timings using the TorchScript version of this model.

I was pleasantly surprised to see the results.
The P90 latency was around 333ms. Latency decreased by 324ms compared to the original training model. This is almost a 50% decrease in latency (0.4931).

Figure 3: Stats from Apache Benchmark Tool for 6 requests on refactored ONNX compatible model — running Torchscript version (Image by Author)

Steps for exporting this model to ONNX and using it on Java Runtime:

Conversion of TorchScript model to ONNX format, snippet referenced below.
We need to provide a valid input sample for the conversion process, as that provides the inferred data types for the model’s variables and parameters.

Exporting the model’s vocabulary via Protocol Buffers so it can be loaded into other run-times such as Java Runtime.
I used the C++ implementation referenced here.
Protocol Buffers — Google’s offering for serializing structured data which is language agnostic.
Details referenced here.

Loading ONNX model to another runtime — i.e JAVA

For measuring the latency on the Java Runtime environment, I used a SpringBoot with reactive framework for serving the web requests.

ONNX Java API can be referenced here.

Steps for setting up this custom timeseries LSTM - ONNX format model on Java Runtime:

Generate the protocol buffers Java implementation for loading the model’s vocabulary.
Once we can parse the ‘proto’ files in Java, we load them into lookup maps.

Implement a word tokenizer or use existing tokenizer libraries.
As I had used the basic Pytorch — ‘torchtext’ tokenizer which is pretty much a white space splitter, I implemented similar functionality in Java.
Load the ONNX model and initialize an ONNX session during application startup.
Additionally, warm up the model with a few trail inferences.

Add logic to convert input string to a tensor, execute inference on the model, and then finally convert the inference result tensor to a user friendly string.

Stats for running the CPU flavor of the exported custom time-series LSTM ONNX format model in JAVA Runtime:

The P90 latency was observed around 241ms. Latency decreased by 416ms compared to the original training model. This is a 63% decrease in latency (0.6331).

Figure 4: Stats from Apache Benchmark Tool for 6 requests on refactored ONNX compatible model — running on Java runtime on SpringBoot (Image by Author)

ONNX with GPU — Java Runtime

Apart from use-case for the reduction in latency for model inference, ONNX framework makes it easy to switch between CPU and GPU executions.

Note: There is some overhead for copying Tensors from CPU to GPU as the tensors are initially created in CPU.
There is a workaround by using the concept of IOBinding, however I came across examples in Python or C++ only.

Additional information about the IOBinding can be found here. See excerpt below.

Similarly if the output is not pre-allocated on the device, ORT assumes that the output is requested on the CPU and copies it from the device as the last step of the Run() call. This obviously eats into the execution time of the graph misleading users into thinking ORT is slow when the majority of the time is spent in these copies. To address this we’ve introduced the notion of IOBinding.

Snippet for running inference on a GPU using the Java runtime is illustrated below.
We need to only change the ‘addCPU’ option to ‘addCUDA’ option.

Stats for running the GPU flavor of the exported custom time-series LSTM ONNX format model in JAVA Runtime:

The P90 latency was observed around 44ms. Latency decreased by 613ms compared to the original training model. This is a 93% decrease in latency (0.9330).

Visualizing Tools:

Graph Visualization:

One of the potential benefits on using the ONNX format, is visualizing the model graph using Netron.

Illustration of the customized LSTM timeseries text generation graph as viewed in Netron.

Figure 5: ONNX graph illustration as viewed in Netron. Each section can be drilled down further as needed.(Image by Author)

Performance Tuning(Execution Timing Visualization):

ONNX runtime has a feature to profile a ONNX runtime session via the options.enableProfiling(profilingFilePath) switch.

This writes out a profiling JSON file with the timings for different components which then can be viewed in a Chrome/Chromium browser.

Once can visualize the profile session by navigating to chrome://tracing and loading the JSON file.

See sample illustration below.

Figure 6: Visualization of ONNX profile json file in Chrome/Chromium at uri — chrome://tracing (Image by Author)

Takeaways:

Simplify the initial model if possible i.e less control flows / looping constructs
Use standardized models, customize the model as a last resort. This has a potential to change the layer of the models, and might be need to be retrained.
Additionally, the restructured model needs to be re-tested for validating its accuracy is within acceptable thresholds.
Finally, customized models might not have same level of optimization as standardized models.

Other Resources for optimizing ONNX format models:

References: