Memory Hygiene With Tensorflow During Model Training and Deployment for Inference

Published in

AI For Real

5 min readMar 4, 2021

If you work on Tensorflow and want to share GPU with multiple processes then you must have encountered into either of one situation:

Let’s see Memory Allocation For Tensorflow Based Model on A GPU:

This is the GPU memory details before loading any Tensorflow Based Workload.

It can be clearly observed that GPU has 10 GB of memory and of which only 489 MB is occupied.

Initial GPU Memory Allocation Before Executing Any TF Based Process

Now let’s load a Tensorflow based process.

We will load object detection model deployed as REST-API via flask running over Twisted.

You can see how quickly the complete GPU memory is filled up as soon as Tensorflow model is loaded:

`Full Memory Allocation

So if we try to start another process it will give you Out of Memory Error. We will just run the same process on another port.

Out of memory error

Above video clearly shows the out of memory error. As Tensorflow aggressively occupy the full GPU memory even though it actually don’t need.

This is a greedy strategy adopted by Tensorflow to avoid memory fragmentation. But this causes Bottleneck of GPU memory. Only one process exclusively has all the memory.

Use a GPU | TensorFlow Core

TensorFlow code, and models will transparently run on a single GPU with no code changes required. Use Note…

www.tensorflow.org

Even though a model can be executed on a far more lesser memory but due to default setting of Tensorflow, Many a times it occupies far more than needed. This results into non-optimal and often wastage of computation power of GPU.

If optimal memory is allocated to Tensorflow process it will occupy less GPU memory and the remaining memory can be shared by other process.

So either you can train multiple models at one go or can execute multiple model for inference at the same time.

Tensorflow has provided few options to address this situation:

First we need to add below line to list the GPU(s) you have.

gpus = tf.config.list_physical_devices('GPU')

In this option we can limit or restrict Tensorflow to use only specified memory from the GPU. In this way you can limit memory and have a fair share the GPU between the different process. Using this option you can define the optimal memory for your process and it will use that memory only.

Setting this optimal memory can be tricky. And you can use tools like Weight and Biases and Tensorboard to look a the System Graphs to come-up with a number for your process. These tools generate very useful graphs with details about your GPU computation, utilization, memory usage and memory transfer.

We will add below code to our process and will execute

if gpus:
    for gpu in gpus:
        tf.config.experimental.set_virtual_device_configuration(gpu,[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])

In the above code we are setting upper bound of 4 GB on GPU memory limit. So when we trigger the process it will occupy only 4 GB memory instead of occupying the full memory.

Let’s look at the screenshot of nvidia-smi about the memory usage after executing our process which is bounded to use only 4 GB of memory. It can be observed that only 4 GB of memory is used instead of 8 GB as shown in the first video.

GPU Memory Allocated After Starting One Process with 4 GB Memory Allocation

Now let’s execute the another process which has previously faced Out Of Memory Error.

Two process running simultaneously with-in their allocated memory

Attached is the screen grab for whole process:

We have clearly seen that using this option we can allocate/override GPU memory allocation for the Tensorflow process and can use GPU resources optimally between the team or process.

There is one more option of setting memory growth. This option will initially allocate the little memory and will keep on allocating more. But caveat with this option is once the memory is allocated it will not be released to avoid memory fragmentation. So even if your process starts on low memory it will acquire more memory gradually but it will not release when it don’t need it. To me where ever possible we should use the first option.

gpus = tf.config.list_physical_devices('GPU')if gpus:
    for gpu in gpus:
       tf.config.experimental.set_memory_growth(gpu,True)

Let’s execute my process one more time and see the memory allocation.

GPU Memory Allocation Before Executing A Process With Set Process

Process is executed, And we can see, NOT full memory is allocated only few MB’s are allocated:

Memory Allocated After Executing Process

Now, We will put some load on GPU. To add load we will make 100 REST call simultaneous to do the object detection. This request will go to the REST server and will put load on GPU and that will cause increase in GPU memory.

In the above screen grab we can see the memory gradually increases but it didn’t occupied the full memory. But it didn’t release the memory even after the load on the GPU is gone. (REST service is still running but inference load is complete) :

Memory Is Occupied Even Load On GPU Is Finished

Conclusion:-

Tensorflow process by default will acquire full memory of GPU. Even though it does not need it. Even if you build a small neural network it will acquire complete GPU memory. This might lead to inefficient GPU utilization if your work load is not heavy.
There are two settings which you can control memory acquired by a Tensorflow process.
Specify the exact memory you want to allocate to your process. This will need tuning and experimenting to arrive at correct number. One can refer to System Graphs of GPU. These graphs are available with Weight & Biases.

Weight & Biases Graph For Process GPU Utilisation and Memory Allocation

Weight & Biases Graph For Overall GPU Utilisation and Memory Allocation

4. You can specify a memory growth option. This option will allocate small memory to the process. And it will increase the allocation as needed with increase in load. But it will not release the acquired memory even when load is complete.

Memory Hygiene With Tensorflow During Model Training and Deployment for Inference

Use a GPU | TensorFlow Core

TensorFlow code, and models will transparently run on a single GPU with no code changes required. Use Note…

Conclusion:-

Written by Tanveer Khan