Creating custom data generator for training Deep Learning Models-Part 2

8 min readSep 9, 2019

Chapter -1: What is a generator function in Python and the difference between yield and return

Chapter-2: Writing a generator function to read your data that can be fed for training an image classifier in Keras.

Chapter-3: Writing generator function for different kinds of inputs — multiple input or sequence of input.

Hello Readers! Welcome back to the second chapter, where we will be writing our own custom data generator to load image data for training an image classifier rather than storing entire data in a list and then feeding them to the image classifier for training.

If you already know what a generator function is, and want to learn about how to use it, to create a custom data generator you can proceed. However, if you don’t know what a generator function is, or want a refresher, quickly go through the first chapter — what is a generator function in Python.

In order to understand how to create different data generators for different kinds of input I will be elucidating 3 kinds of input scenarios and you can use this as a reference and create different generator functions as per your requirements. The three different input scenarios are depicted below:

Scenario-1: when a single input is to be fed to a network

Scenario-2: when a sequence of input (e.g, video) is to be fed to the network

Scenario-3: When you have multiple inputs that need to be fed to an ensemble of model

In this post we will just create the data generator for the first scenario and in the subsequent post we will do it for the second and third scenarios.

To elucidate the first scenario, I will be using a subset of the flower data set from the Kaggle competition — https://www.kaggle.com/alxmamaev/flowers-recognition.

The data set consists a total of 4326 images distributed over 5 flower categories — Daisy, Dandelion, Rose, Sunflower, and Tulip, as shown below

One pre-processing that can be done is renaming all the images in each folder in the appropriate order (eg, say from gibberish name to daisy_000001.png). The picture below elucidates what I am saying-

Renaming files appropriately from gibberish names to a format of say (000001.png, 000002.png, and so on )

The script to do this is here — https://github.com/anujshah1003/useful-scripts-for-handling-data/tree/master/renaming_multiple_files_in_sequence and a video explaining the code is here — https://www.youtube.com/watch?v=g3djohbIK3Q&t=409s

The template which we will be following for creating a custom data generator is taken from this amazing blog — www.jessicayung.com/using-generators-in-python-to-train-machine-learning-models/

We are going to follow this template for creating any data generator. The sample list needs to be prepared as shown above. Each element in the samples list consists of an input file name and its corresponding label. If we are reading a sequence of video frames for video classification then our samples can be structured as

samples — [[[frame1_filename,frame2_filename,…],label1], [[frame1_filename,frame2_filename,…],label2],……….].

So you can use the above format to structure any kind of input while preparing your custom data generator.

For our data set, we need to create a CSV file having all filenames and corresponding labels which can then be easily loaded to create our data generator for flower recognition. So, for that, I created two CSV files flower_recognition_train.csv and flower_recognition_test.csv. The test set is created by randomly picking 60 samples from each flower category.

Read the data path and assign labels to each flower category

Now create two data frames one for train and one for test csv with 3 columns of FileName, Label, and ClassName

Now loop over every file in each flower category, read each file, and exclude it if it is corrupted if not include it in the train or test data frame based on the random index chosen for test data.

code to create train and test data frame

Now save the train and test data frame as train and test CSV files which you can later load for creating the data generator

My working directory looks like this now

So now everything is in place and we can write our data generator

import os
import pandas as pddef load_samples(csv_file):
    data = pd.read_csv(os.path.join('data_files',csv_file))
    data = data[['FileName', 'Label', 'ClassName']]
    file_names = list(data.iloc[:,0])
    # Get the labels present in the second column
    labels = list(data.iloc[:,1])
    samples=[]
    for samp,lab in zip(file_names,labels):
        samples.append([samp,lab])
    return samplessamples = load_samples(flower_recognition_train.csv)

the samples are a list of 4023 items of the format [[image1_filename,label1], [image2_filename,label2],…].
let us see some of the first few and last few examples

print (samples[0:5])output:
[['flowers_renamed\\daisy\\daisy_000001.png', 0],
 ['flowers_renamed\\daisy\\daisy_000003.png', 0],
 ['flowers_renamed\\daisy\\daisy_000004.png', 0],
 ['flowers_renamed\\daisy\\daisy_000005.png', 0],
 ['flowers_renamed\\daisy\\daisy_000006.png', 0]]print (samples[-15:-10])output:
[['flowers_renamed\\tulip\\tulip_000969.png', 4],
 ['flowers_renamed\\tulip\\tulip_000970.png', 4],
 ['flowers_renamed\\tulip\\tulip_000971.png', 4],
 ['flowers_renamed\\tulip\\tulip_000973.png', 4],
 ['flowers_renamed\\tulip\\tulip_000974.png', 4]]

you can see the file name and their corresponding label. we are going to use this to create our data generator. The code segment shown below is taken from this blog-www.jessicayung.com/using-generators-in-python-to-train-machine-learning-models/

def generator(samples, batch_size=32,shuffle_data=True,resize=224):
    """
    Yields the next training batch.
    Suppose `samples` is an array [[image1_filename,label1], [image2_filename,label2],...].
    """
    num_samples = len(samples)
    while True: # Loop forever so the generator never terminates
        shuffle(samples)

        # Get index to start each batch: [0, batch_size, 2*batch_size, ..., max multiple of batch_size <= num_samples]
        for offset in range(0, num_samples, batch_size):
            # Get the samples you'll use in this batch
            batch_samples = samples[offset:offset+batch_size]

            # Initialise X_train and y_train arrays for this batch
            X_train = []
            y_train = []

            # For each example
            for batch_sample in batch_samples:
                # Load image (X) and label (y)
                img_name = batch_sample[0]
                label = batch_sample[1]
                img =  cv2.imread(os.path.join(root_dir,img_name))
                
                # apply any kind of preprocessing                img = cv2.resize(img,(resize,resize))
                # Add example to arrays
                X_train.append(img)
                y_train.append(label)

            # Make sure they're numpy arrays (as opposed to lists)
            X_train = np.array(X_train)
            y_train = np.array(y_train)

            # The generator-y part: yield the next training batch            
            yield X_train, y_train

We are defining a function that takes as input :

samples — the list of (filename,label pairs)
batchsize=32 — number of samples to load at once
shuffle_data — whether to shuffle the data or not
resize=224- what value to resize the images to (you can give parameters for any kind of preprocessing)

we are creating a while loop that never terminates. After that, we have a for loop to loop over as many batches as defined by batch size and then we are loading the current batch of samples in batch_samples. To store the image sample and labels we are creating two empty lists at the beginning-x_train and y_train. Now we loop over each sample in the current batch for each sample

1. Get the image name and the corresponding label
2. Read the image using cv2 and apply any kind of pre processing or transformations
3. Append the read image and label to x_train and y_train list respectively
4. Once all the samples in the current batch is read and loaded in the x_train and y_train list, convert this to an array and return it.

In the above function where I am resizing the image, you can call a preprocessing function in which various pre-processing can be applied like normalization, rotation, etc.

Now let us call the defined generator and check some values, since we have a batch size of 8 and an image size of 224, the input shape is (8,224,224,3) and there are 8 corresponding labels to these 8 images

# this will create a generator object
train_datagen = generator(samples,batch_size=8,shuffle=True)x,y = next(train_datagen)
print (x.shape)
#output: (8, 224, 224, 3)
print (y)
#output: [0 1 1 4 3 1 4 2]#### we can plot the data and see by ourselves
import matplotlib.pyplot as plt
fig = plt.figure(1,figsize=(12,12))
for i in range(8):
  plt.subplot(4,4,i+1)
  plt.tight_layout()
  x[i] = x[i][:,:,::-1]
  plt.imshow(x[i], interpolation='none')
  plt.title("class_label: {}".format(y[i]))
  plt.xticks([])
  plt.yticks([])

we can use this generator to train a Keras model. Keras has its own image data generator which you can use for training such problems. You can read this blog to see keras ImageDataGenerator in action- https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html. I wanted to use this blog to explain how a data generator is created for image classification problems where just one image is input, in the next blog we will see how to create a data generator for multiple inputs (to feed into multiple models) or multi-channel input (say a sequence of frames from a video).

# Import list of train and validation data (image filenames and image labels)
train_samples = load_samples(flower_recognition_train.csv)
validation_samples = load_samples(flower_recognition_test.csv)

# Create generator
train_generator = generator(train_samples, batch_size=32)
validation_generator = generator(validation_samples, batch_size=32)

Now we have everything ready, we can either use a pre-trained network like VGG16, ResNet, Inception, etc or design your own network and train it using the created data generator.

For now, let’s define our own network. I will use Keras here but you can use this dataloader format and structure it to train PyTorch or TensorFlow model as well:

# import the necessary modules from the library
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, Activation, MaxPooling2D, Dropoutinput_shape = (Config.resize,Config.resize,3)
print (input_shape)model = Sequential()model.add(Conv2D(32, (3,3),padding='same',input_shape=input_shape,name='conv2d_1'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2),name='maxpool2d_1'))model.add(Conv2D(32, (3, 3),name='conv2d_2'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2),name='maxpool2d_2'))model.add(Dropout(0.5))model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(Config.num_classes))
model.add(Activation('softmax'))model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])model.fit_generator(
        train_generator,
        steps_per_epoch=num_train_samples // batch_size,
        epochs=Config.num_epochs,
        validation_data=validation_generator,
        validation_steps=num_test_samples // batch_size)Epoch 1/10
402/402 [==============================] - 1249s 3s/step - loss: 1.1912 - acc: 0.5431 - val_loss: 1.1057 - val_acc: 0.6333
Epoch 2/10
402/402 [==============================] - 1245s 3s/step - loss: 0.7368 - acc: 0.7438 - val_loss: 1.0109 - val_acc: 0.6667
Epoch 3/10
402/402 [==============================] - 10348s 26s/step - loss: 0.4385 - acc: 0.8529 - val_loss: 1.0953 - val_acc: 0.6467
Epoch 4/10
  5/402 [..............................] - ETA: 21:18 - loss: 0.2822 - acc: 0.8750

The model starts training using the created train and validation generator. This post is not for showing you how to train an efficient model but just to show you how to create and use a data generator

All the codes in this post are available on GitHub as Python notebook and Python scripts:

for creating train and test CSV files — https://github.com/anujshah1003/custom_data_generator/blob/master/flowers_recognition/preparing_labeled_train_test_csv_files.ipynb

for the data generator creation — https://github.com/anujshah1003/custom_data_generator/blob/master/flowers_recognition/data_generator_demo.ipynb

To conclude, in this post, we saw how we can create our own custom data generator and we did this for the first scenario when you have just single input to be fed to the network. we need to prepare a list in the format given below which can then be used for creating the generator function

samples — [[frame1_filename,label1], [frame2_filename,label2],……….].

In the subsequent post, we will see how we can create a generator function for the second and third scenarios when you have multiple inputs or sequences of input.

Till then, Keep Learning!! Keep Exploring Neurons!!!!

If you find my articles helpful and wish to support them — Buy me a Coffee

Creating custom data generator for training Deep Learning Models-Part 2

Written by Anuj shah (Exploring Neurons)