Dump Keras-ImageDataGenerator. Start Using TensorFlow-tf.data (Part 1)
Stop using Keras-ImageDataGenerator because…
Why did I stop using Keras- ImageDataGenerator? Simply because it is slow, in fact, 5 times slower than Tensorflow- tf.data when loading the images.
Image classification and object detection are few of the problems where Artificial Intelligence has become extremely good (of course there is always room for improvement). TensorFlow, Keras, PyTorch, among other open-source Machine Learning frameworks, are extensively used by the research community to train their own models or use the freely available pre-trained models.
The first step in any machine learning problem is to have a good/clean dataset and then LOAD this Dataset to train your model. In this article, I will discuss two different ways to load an image dataset — using Keras or TensorFlow (tf.data) and will show the performance difference.
Comparison (in advance)
** cache
variable will be shown/defined later in the codes.
Here, I have shown a comparison of how many images per second are loaded by Keras.ImageDataGenerator and TensorFlow’s- tf.data (using 3 different cases for inbuildcache
variable as shown in the above table). The above results are compared on a workstation with the ubuntu 20.04 operating system having 16-GB RAM, 2.80 GHz with Core i7. The dataset was downloaded from Kaggle- dogs_and_cats having 10000 images in total (train-8000, validation-2000).
- When using
cache=True
,tf.data
loads approx. 2511 images/sec in comparison to 479 images/sec withKeras.ImageDataGenerator
. This showstf.data
is more than 5 times faster thanKeras.ImageDataGenerator
. - When using
cache=False
,tf.data
is approx. 2 times faster thanKeras.ImageDataGenerator
. - When using
cache='some_path.tfcache'
, first-time when you will run the codetf.data
will make a dump in your computer directory/memory. This is why it is slower during first-time. Once, the dump is created in the memory,tf.data
is quick to load images for any future iteration (approx. 4 times faster.
I personally prefer
cache='some_path.tfcache'
. Although, with the above datasetcache='some_path.tfcache'
performance is poorer thancache=True
. But, I have tested it for bigger datasets (Gigabytes or Terabytes datasets) and found that dumping the dataset in the memory performs quicker than when usingcache=True
. In any machine learning problem, we have to optimize the model parameters so even ifcache='some_path.tfcache'
is slower for the first time, it will save a lot of time during second, third, and next iterations/epochs when playing with the different hyperparameters and models.
Having compared the performance in advance, now I will show you the written code which you can use to test and check the performance difference by yourself.
1. Keras- ImageDataGenerator
Keras provides an easy-to-use function, using which we can do various kinds of augmentations on the images including scaling, rotation, zoom, flips, etc in just one line of code. Further, .flow_from_directory()
is used to generate batches of image data (and their labels) directly from our jpgs in their respective directories.
2. TensorFlow- tf.data
Here, we will write our own input pipeline from scratch using tf.data. Understanding the below code for using tf.data might look overwhelming at first. But once understood, the same code can be used for different datasets problems with minimal changes.
.prefetch()
the batches of data in the background while the model is training..cache()
keeps the images in memory after they're loaded off disk during the first epoch. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache. For more details check the link.
How to check the loading time for images
This function is used to load the images in batches (using iterators) and show how many images are loaded per second. All the images are broken in steps=1000
batches by default (you can increase or decrease this number).
All the Code together
The dog_vs_cat dataset has train
and val
folders further divided into the respective directories as shown, having in total 10000 pictures of cats and dogs.
This code has been tested with TensorFlow 2.x and it is shown here that tf.data
is 5 times quicker than Keras.ImageDataGenerator
to load images.
P.S. In the next part, I will compare the
tf.data
andKeras.ImageDataGenerator
actual training times usingmobilenet
model running for 5 epochs using GPU. Stay tuned…