Keras shoot-out, part 2: a deeper look at memory usage

In a previous article, I used Apache MXNet and Tensorflow as Keras backends to learn the CIFAR-10 dataset on multiple GPUs.

One of the striking differences was memory usage. Whereas MXNet allocated a conservative 670MB on each GPU, Tensorflow allocated close to 100% of available memory (a tad under 11GB).

I was a little shocked by this state of affairs (must be the old-school embedded software developer in me). The model and data set (respectively Resnet-50 and CIFAR-10) didn’t seem to require that much memory after all. Diving a little deeper, I learned that this is indeed the default behaviour in Tensorflow: use all available RAM to speed things up. Fair enough :)

Still, a fact is a fact: in this particular setup, MXNet is faster AND memory-efficient. I couldn’t help but wonder how Tensorflow would behave if I constrained its memory usage. Let’s find out, shall we?

Tensorflow settings

As a number of folks pointed out, you can easily restrict the number of GPUs that Tensorflow uses, as well as the fraction of GPU memory that it allocates (a float value between 0 and 1). Additional information is available in the Tensorflow documentation.

Just take a look at the example below.

With this in mind, let’s start restricting memory usage. I’m curious to find out how low we can actually go and if there’s any consequence on training time.

Test setup

I’ll run the same script as in the previous article (keras/examples/cifar10_resnet50.py), with the following parameters:

  • 1 GPU on p2.8xlarge instance,
  • batch size set to 32,
  • no data augmentation.
  • increasingly harsher memory usage constraints: none, 0.8, 0.6, 0.4, 0.2 and lower… if we can!

Our reference point will be MXNet: 658MB of allocated memory, 155 seconds per epoch.

Test results

After a little while, here are the results for memory usage and epoch time.

  • No restriction: 10938MB, 211 seconds.
  • 0.8: 9282MB, 211 seconds.
  • 0.6: 6994MB, 211 seconds.
  • 0.4: 4706MB, 211 seconds.
  • 0.2: 2418MB, 211 seconds.
  • 0.1: 1274MB, 212 seconds.
  • 0.08: 1045MB, 212 seconds.
  • 0.06: 816MB, 211 seconds.
  • 0.05: 702MB, 211 seconds.
  • 0.045: not working (OOM error).

Conclusion

Again, this is a single test and YMMV. Still, a few remarks.

By default, Tensorflow allocates as much memory as possible, but more memory doesn’t mean faster. So why behave like a hog in the first place?Especially since Tensorflow can actually get to a memory footprint similar to MXNet (although it’s really a trial and error process).

This behaviour still raises a lot of questions that trouble my restless mind :)

  1. What about very large models? Would they run out of memory and would I need to tweak the memory setting to make them fit?
  2. What about CPU training? It’s possible to limit the number of cores used by Tensorflow, but I couldn’t find any way to limit RAM usage (please correct me if I’m wrong).
  3. What about inference? Is Tensorflow memory usage just as “liberal”? Would this be a problem for constrained devices like my beloved Raspberry Pi?

Oh boy. More questions than when I started. Typical :) I’ll have to investigate!

All in all, I guess I’m more comfortable with a library like MXNet that allocates memory as needed and gives me a clear view on how much is left, what the impacts are when parameters are tweaked, etc.

Call it personal preference. And of course, MXNet is quite faster too.

Thanks for reading. Stay tuned for more articles!