A guide to build fast input pipeline using tf.data api vs ImageDataGenerator

tf.data vs ImageDataGenerator

Rohit Kumar
2 min readApr 23, 2020

As per the documentation the tf.data API enables you to build complex input pipelines from simple, reusable pieces.It makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations.

But wait!

Why we need to learn and use new api if our old ImageDataGenerator was doing fine.

Lets look close to ImageDataGenerator…

It is a class that is use to generate batches of tensor image data with real-time data augmentation.

It provide us with methods like flow,fit etc but the most common one and the one which is used extensively is flow_from_directory.

The flow_from_directory method takes the path to a directory & generates batches of augmented data on the fly.

The code would look like

Straight and simple.But very very slow…want proof ? read further

To implement Using tf.data api

The tf.data.Dataset is a method to API which supports writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:

  1. Create a source dataset from your input data.
  2. Apply dataset transformations to preprocess the data.
  3. Iterate over the dataset and process the elements.

Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

Code would look like

Step 1 Create a source dataset from input data

Step 2 Apply dataset transformations to preprocess the data

Step 3 Iterate over the dataset

The code look long but is self explanatory.

But is it worth it…Lets check

Image loading time for ImageDataGenerator
Image loading time for tf.data API

One can see that tf.data is around 10 times faster…(612 images per/sec vs 6444 images per/sec)

Not only it is faster but it also provide us to load images of different format and also where the folder structure is complex.

Deep folder structure

For loading the image where the folder structure is fairly deep or complex and image type is different like the above image you have to wait for my next tutorial.

The github link for code is https://github.com/razerspeed/dataflow

Till then happy coding

--

--