Dump Keras-ImageDataGenerator. Start Using TensorFlow-tf.data (Part 1)

Stop using Keras-ImageDataGenerator because…

Sunny Chugh
The Startup
4 min readJul 28, 2020

--

Why did I stop using Keras- ImageDataGenerator? Simply because it is slow, in fact, 5 times slower than Tensorflow- tf.data when loading the images.

Image classification and object detection are few of the problems where Artificial Intelligence has become extremely good (of course there is always room for improvement). TensorFlow, Keras, PyTorch, among other open-source Machine Learning frameworks, are extensively used by the research community to train their own models or use the freely available pre-trained models.

Photo by the Author. From YouTube Video

The first step in any machine learning problem is to have a good/clean dataset and then LOAD this Dataset to train your model. In this article, I will discuss two different ways to load an image dataset using Keras or TensorFlow (tf.data) and will show the performance difference.

Comparison (in advance)

** cache variable will be shown/defined later in the codes.

Here, I have shown a comparison of how many images per second are loaded by Keras.ImageDataGenerator and TensorFlow’s- tf.data (using 3 different cases for inbuildcache variable as shown in the above table). The above results are compared on a workstation with the ubuntu 20.04 operating system having 16-GB RAM, 2.80 GHz with Core i7. The dataset was downloaded from Kaggle- dogs_and_cats having 10000 images in total (train-8000, validation-2000).

  • When using cache=True , tf.data loads approx. 2511 images/sec in comparison to 479 images/sec with Keras.ImageDataGenerator. This shows tf.data is more than 5 times faster than Keras.ImageDataGenerator.
  • When using cache=False , tf.data is approx. 2 times faster thanKeras.ImageDataGenerator.
  • When using cache='some_path.tfcache' , first-time when you will run the code tf.data will make a dump in your computer directory/memory. This is why it is slower during first-time. Once, the dump is created in the memory, tf.data is quick to load images for any future iteration (approx. 4 times faster.

I personally prefer cache='some_path.tfcache'. Although, with the above dataset cache='some_path.tfcache' performance is poorer than cache=True. But, I have tested it for bigger datasets (Gigabytes or Terabytes datasets) and found that dumping the dataset in the memory performs quicker than when using cache=True. In any machine learning problem, we have to optimize the model parameters so even if cache='some_path.tfcache' is slower for the first time, it will save a lot of time during second, third, and next iterations/epochs when playing with the different hyperparameters and models.

Having compared the performance in advance, now I will show you the written code which you can use to test and check the performance difference by yourself.

1. Keras- ImageDataGenerator

Keras provides an easy-to-use function, using which we can do various kinds of augmentations on the images including scaling, rotation, zoom, flips, etc in just one line of code. Further, .flow_from_directory() is used to generate batches of image data (and their labels) directly from our jpgs in their respective directories.

2. TensorFlow- tf.data

Here, we will write our own input pipeline from scratch using tf.data. Understanding the below code for using tf.data might look overwhelming at first. But once understood, the same code can be used for different datasets problems with minimal changes.

  • .prefetch() the batches of data in the background while the model is training.
  • .cache() keeps the images in memory after they're loaded off disk during the first epoch. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache. For more details check the link.

How to check the loading time for images

This function is used to load the images in batches (using iterators) and show how many images are loaded per second. All the images are broken in steps=1000 batches by default (you can increase or decrease this number).

All the Code together

The dog_vs_cat dataset has train and val folders further divided into the respective directories as shown, having in total 10000 pictures of cats and dogs.

Image by the author

This code has been tested with TensorFlow 2.x and it is shown here that tf.data is 5 times quicker than Keras.ImageDataGeneratorto load images.

P.S. In the next part, I will compare the tf.data and Keras.ImageDataGeneratoractual training times using mobilenetmodel running for 5 epochs using GPU. Stay tuned…

--

--

Sunny Chugh
The Startup

Research Fellow @ City, University of London | PhD in Machine Learning for Photonics Applications