Optimizing TensorFlow Models for Inference

Published in

Tinyclues Vision

10 min readJan 28, 2022

How we found a simple way to reduce the memory consumption of production models by up to 80% with no loss in predictive power!

When working with Machine Learning models in production, especially at scale, performance is king. In Tinyclues we know this by heart, often pushing our ML stack to its limits having deployed over 100,000 marketing campaigns for our clients.

When migrating the inference of our TensorFlow models to BigQuery ML, we faced a constraint in which not enough memory could be allocated to execute some of our biggest models.

Our solution consists of optimizing the low-level representation of our TensorFlow models, reducing model-related memory consumption by up to 80% in production. Unlike compression methods, such as clustering and mixed-precision, our approach returns the exact same model and thus, does not sacrifice any predictive quality.

This article will showcase our thought process behind the optimizations and show you how you can apply them to any TensorFlow model, regardless of type and deployment environment.

🖥️ The source code for the experiments and benchmarks in this article is available in this Colab notebook.

Finding the Bottleneck

Let’s start with a crucial observation: The vast majority of incremental memory consumption happens while loading a model, not while running the inference.

We can show this by creating a simple dense TensorFlow model using the Keras API:

Using this model to run inference on a dummy dataset with 20,000 input Tensors, we clearly see that the majority of both time and memory consumption occurs during the loading phase of the model.

A Tale of Two Models

Suppose we desire to add a sigmoid activation function in between the layers of our previous model. With the Functional API, Keras provides multiple ways to represent this — let’s take a deep dive into two possible implementations:

We can define it as an argument of the Dense Layer (what we’ll call Internal Activation)
or, as a separate, Activation Layer (we’ll call it External Activation)

Clearly, these models are functionally equal. They implement the same logic and have the same inputs, outputs, hidden layers, and activations. Mathematically speaking, they are the same.

As such, we expect them to have the same performance and memory consumption… right?

Let’s test it! To test this, we will launch two separate processes that will first load TensorFlow and then one of the two models. Upon benchmarking their memory footprint, we found some very interesting results.

Note that the percentages measure the reduction compared to the least efficient model.

The models might be functionally the same, but they don’t behave the same way! As we can see, the model with the Internal Activation needed

>100 MB less memory than External Activation.
and loaded ~20% faster.

Note that the difference becomes even more pronounced when taking into account the startup cost of a Python process that just loads TensorFlow. Just for running import tensorflow as tf a process takes around 3.5 seconds and consumes 400MB of RAM.

Percentages when accounting for the load time and memory consumption of TensorFlow itself

Adjusting for the TensorFlow baseline consumption, we have an overall difference of 38% less memory for a model using Internal Activation. Since TensorFlow loading is invariant and necessary for any analysis, we will always adjust for it when making comparisons going forward.

How can we explain this?

As the number and type of weights are the same for both models, the difference must lie in the complexity of the internal, low-level representation of the model.

Let’s recall that Keras is a high-level API. Keras code is thus compiled into a lower-level graph of operations (referred to as the tf.Graph and hereafter spelled as Graph with a capital “G”) before being run by TensorFlow.

Thus, different notations of the same model can result in different Graphs, with some being more efficient than others.

The Structure of TensorFlow Graphs

A TensorFlow Graph consists of interconnected nodes, each representing an operation. In TensorFlow terms, an operation can either be an operation on existing Tensors (like multiplication or addition) or the generation of a new Tensor, such as loading an input or a Constant.

Graphs are directed and acyclic, meaning that information flow from an input layer to an output layer with no cycles or loops in between. Furthermore, they are static, meaning that once a Graph is set up, the number and arrangement of its nodes do not change.

Let’s create a hypothetical Graph showcasing the computation of a simple polynomial: f(x,y) = 2x + xy + y²

A simple graph to describe the function f(x) = 2x + xy +y²

Notice, however, that there is more than one way to represent this function.

Three ways to represent the same polynomial

It is easy to see how we can use different Graphs to represent the same operation, some being simpler than others. These differences can explain the differences in performance we observed in our twin models above.

The challenge is to figure out a way to get consistently better graphs, which would provide us with better model performance without compromising predictive power. To do so, let’s look at two methods we can use to consistently improve our Graphs.

Two Steps to Optimize TensorFlow Graphs

Part 1 — Converting Variables to Constants

As we can see in our polynomial, Tensors can be Variables or Constants, depending on whether their values change throughout the training process.

Constants are fixed attributes of the model, such as hard-coded values on the model’s code or even numerical constants applied to the Tensors.

Variables can handle data that needs to be malleable and changed during the model execution (such as the weights). Since they have more complex behavior than Constants, they are more expensive to load and manipulate.

In inference, however, we no longer need to change the weights so this malleability is no longer needed. This means that by converting them to Constants, we will have the same overall graph, but represented in a simpler, more memory efficient, way.

But before we go around converting everything to Constants, there is one more subtlety of TensorFlow to be aware of. Sometimes, a graph isn’t (just) a single graph.

Part 2 — Flatting subgraphs into a single, model-wide, Graph

To better manage its internal logic, TensorFlow introduces nodes into the operation graph which serve as pointers to subgraphs. This means that, during training, a TensorFlow Graph will have many nested layers of graphs.

Conceptual Visualization of Nesting Behavior

Again, this nesting is convenient for training the model but is no longer necessary for inference. Flattening, or inlining, the subgraphs into one all-encompassing graph, actually results in better performance during inference.

What we would thus need is a way to both convert variables into constants and to inline the graph to obtain a simplified graph, which requires less computational resources.

Implementing Optimizations in TensorFlow 2

TensorFlow has the module tensorflow.python, which hosts a wide array of back-end functions necessary to support the higher-level APIs. While usually not exposed to users (it is not included in a import tensorflow as tf call), it hosts many valuable low-level functions that are very powerful when correctly used.

Inside of it resides the function convert_variables_to_constants_v2 . Despite its name, it not only does the conversion to Constants but must also inline the graph in the process. A single function call that, thus, is able to apply both optimizations!

Introduction to Concrete Functions

Since we’re dealing with a low-level TensorFlow function, we cannot work with Keras models. Thus, we need to first convert our models into the lower-level TensorFlow objects.

We can achieve this by calling the model into a tf.function and then tracing the model to obtain the ConcreteFunction for our given model (for more information, see this talk by TensorFlow's team).

A ConcreteFunction is simply an object in lower-level TensorFlow code that wraps around the compiled graph. With it, we not only can access the graph itself via the method .graph but also other helpful attributes and methods related to this graph such as .inputs.

Finally, we can convert the Variables to Constants in this graph with a simple call of our previously imported convert_variables_to_constants_v2.

In fact, this optimization is also found under TensorFlow Lite, TensorFlow’s highly optimized implementation for mobile and edge devices. The steps shown above can replicate the same performance improvements while still being in TensorFlow core, thus keeping its full flexibility for development and especially deployment.

In our case, since BigQuery ML does not support TensorFlow Lite models as of today, it was critical that these optimizations work directly in TensorFlow Core, so we could keep supporting BQML.

The appendix at the end of the article includes information on how to work with concrete functions to implement the optimizations your own model!

Effects on Performance and Memory Usage

Let’s go back to our example using the two different implementations for the same model. If our intuition is correct, converting Variables to Constants and Inlining their respective graphs should reduce the difference in performance.

By benchmarking both models after applying the function, we can see that they now perform the same. Finally, equal models show equal performance!

But how does the optimized model compare to the initial two?

Percentages as compared to the Internal Activation Model, adjusted for the TensorFlow load time

The results are striking! Adjusted for the resources required to load TensorFlow itself, we achieve a 90% reduction in memory consumption as well as a 95% improvement in load times.

Importantly, this technique does not work only in sandbox environments, but it translated perfectly into our production models!

With it, we reduced the memory consumption of our production models by up to 80%. This was a game-changer, adding free performance benefits and enabling us to reduce cloud costs and deploy even some of our biggest models into BigQuery ML, which were previously going beyond the resource limitations of the platform.

Conclusion

TensorFlow’s abstractions are truly amazing, enabling developers to build and use incredibly complex Deep Learning models and workflows in a simple way. However, as we’ve seen, some of these abstractions can hide great opportunities for optimization and performance improvement, especially in a size-sensitive deployment environment on production.

It shows the power of tailoring your model to different stages of the machine learning lifecycle and how specific conditions from each part of it can create the constraints necessary to achieve great performance improvements. While in this case, it was the inference not needing variables inside the graph, we invite developers to further explore the particularities of their use cases at a lower level to adapt TensorFlow to their own needs.

References and Comments

We would also like to thank both Hung-Ju Tsai, Anavai G Ramesh, and Gandhimathi Anand from Intel and Lei Mao, who wrote great articles, pointing out the improvement of inlining graphs and converting variables to constants.

Their findings kickstarted our investigation into the main mechanisms of this behavior and led to the subsequent internal adoption of the approach described above.

The issue regarding the variability in performance among equivalent implementations of the same model has been raised to the Keras team, you can follow it on Github here.

[Appendix] Working with Concrete Functions

Since we are now working directly with Concrete Functions and not the Keras API, we will have slightly different workflows than usual.

Thankfully for inference, we no longer need to concern ourselves with most of the changes, except for executing the inference itself and saving and loading the model.

Saving with Concrete Functions

Saving a model uses a slightly different process than Keras. We must pass the Concrete Function as the function attribute of a tf.Module and then apply tf.saved_model.save to that module itself to save it successfully.

Loading, however, is much simpler: it suffices to use the same saved_model API directly on the saved file.

Notice that we did not need to apply any further manipulation and, otherwise, our model behaves exactly the same as it did before converting to constants, except now it is significantly faster!

Inference with Concrete Functions

Executing the inference for a given input is a really simple process with Concrete Functions. We can use the __call__() method by using the model(inputs) notation (like we did on the tf.function previously!

Overall, despite the less usual workflow, working with Concrete Functions is not that different from Keras models. While some adjustments are needed in very particular cases, the performance benefits are well worth the transition!