Dealing with memory leak issue in Keras model training

Anuj Arora
Dive into ML/AI
Published in
2 min readDec 3, 2020

--

Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA’s Tesla V100-DGXS-32GB. When trained for large number of epochs, it was observed that there was memory build-up / leakage. What this meant was, as the training progressed, it was consuming more and more disk space until none was left, crashing the job or system.

One look over the internet and it was clear that, this problem has been around for sometime now. Some users, linked the issue to model.predict(), which I had included in my callbacks. In the same discussion, a suggested solution was:

  • Instead of passing a np.array to the model.predict(), pass a tensor by using tf.convert_to_tensor(). The associated explanation mentions that,

for loop with a numpy input creates a new graph every iteration because the numpy array is created with a different signature. Converting the numpy array to a tensor maintains the same signature and avoids creating new graphs.

  • Going by the explanation from the solution above, another proposed solution I could find was replacing model.predict() with model.predict_on_batch().
  • I also tried cloning the trained model using keras.models.clone(model), and use the cloned model, as in, cloned_model.predict(). After the predict step, I’d delete the cloned model, hoping it will handle this memory build up.

--

--