Tutorial on Keras ImageDataGenerator with flow_from_dataframe

Keras flow_from_dataframe example article.

A more detailed tutorial can be found at https://medium.com/@vijayabhaskar96/tutorial-on-keras-flow-from-dataframe-1fd4493d237c

Installation of keras-preprocessing library:
Keras seems like taking time to migrate changes from keras-preprocessing library to Keras itself, So if you wish to use this flow_from_dataframe feature I suggest you do the following after you have installed keras,
pip uninstall keras-preprocessing
pip install git+https://github.com/keras-team/keras-preprocessing.git
and import ImageDataGenerator from keras_preprocessing instead of keras.preprocessing

from keras_preprocessing.image import ImageDataGenerator

Now you can utilize Keras’s ImageDataGenerator to perform image augmentation by directly reading the CSV files through pandas dataframe.

Most often the Image datasets available on the internet are either has images placed under folders which has their respective class names or placed under a single folder along with a CSV or JSON file which maps the image filenames with their corresponding classes.

In former case, we already have the flow_from_directory method that helps you read the images from the folders, but in the later case you will need to write either a custom generator or move the image files to their respective class name folders and use flow_from_directory to utilize ImageDataGenerator, but now with the help of flow_from_dataframe method you can directly pass the Pandas DataFrame which has the mapping between filenames of the images and their labels.

The best part about the flow_from_dataframe function is that, you can pass any column/s as target data(with class_mode=”other”, any data that is provided on a column or list of columns will be treated as raw target Numpy arrays). You can even do regression tasks that takes images as inputs and target values as outputs and it also supports multiple numerical target columns, so now you can create multi output neural networks easily.

The flow_from_dataframe accepts all the arguments that flow_from_directory accepts,and obvious mandatory arguments like

dataframe- Pandas DataFrame which contains the filenames and classes or numeric data to be treated as target values.
directory- Path to the folder which contains all the images,None if x_col contains absolute paths pointing to each image instead of just filenames.
x_col- The column in the dataframe that has the filenames of the images
y_col- The column/columns in the dataframe in the filename that will be treated as raw target values if class_mode=”other” (useful for regression tasks) or they will be treated as name of the classes if class_mode is “binary”/”categorical” or they will be ignored if class_mode is “input”/None.
class_mode- In addition to all the class_modes previously available in flow_from_directory, there is “other”.
drop_duplicates- Boolean, whether to drop duplicate rows based on filename,True by default.

Which allows you to treat all the data in the y_col column/columns as raw target values.

# Arguments
dataframe: pandas like dataframe.
directory: string,path to the target directory that contains all the images mapped in the dataframe,
You could also set it to None if data in x_col column are absolute paths.
x_col: string,column in the dataframe that contains
the filenames of the target images.
y_col: string or list of strings,columns in
the dataframe that will be the target data.
target_size: tuple of integers `(height, width)`,
default: `(256, 256)`.
The dimensions to which all images
found will be resized.
color_mode: one of "grayscale", "rbg". Default: "rgb".
Whether the images will be converted to have
1 or 3 color channels.
classes: optional list of classes
(e.g. `['dogs', 'cats']`). Default: None.
If not provided, the list of classes will be automatically
inferred from the y_col,
which will map to the label indices, will be alphanumeric).
The dictionary containing the mapping from class names to class
indices can be obtained via the attribute `class_indices`.
class_mode: one of "categorical", "binary", "sparse",
"input", "other" or None. Default: "categorical".
Determines the type of label arrays that are returned:
- `"categorical"` will be 2D one-hot encoded labels,
- `"binary"` will be 1D binary labels,
- `"sparse"` will be 1D integer labels,
- `"input"` will be images identical
to input images (mainly used to work with autoencoders).
- `"other"` will be numpy array of y_col data
- None, no labels are returned (the generator will only
yield batches of image data, which is useful to use
`model.predict_generator()`, `model.evaluate_generator()`, etc.).
batch_size: size of the batches of data (default: 32).
shuffle: whether to shuffle the data (default: True)
seed: optional random seed for shuffling and transformations.
save_to_dir: None or str (default: None).
This allows you to optionally specify a directory
to which to save the augmented pictures being generated
(useful for visualizing what you are doing).
save_prefix: str. Prefix to use for filenames of saved pictures
(only relevant if `save_to_dir` is set).
save_format: one of "png", "jpeg"
(only relevant if `save_to_dir` is set). Default: "png".
follow_links: whether to follow symlinks inside class subdirectories
(default: False).
subset: Subset of data (`"training"` or `"validation"`) if
`validation_split` is set in `ImageDataGenerator`.
interpolation: Interpolation method used to resample the image if the
target size is different from that of the loaded image.
Supported methods are `"nearest"`, `"bilinear"`, and `"bicubic"`.
If PIL version 1.1.3 or newer is installed, `"lanczos"` is also
supported. If PIL version 3.4.0 or newer is installed, `"box"` and
`"hamming"` are also supported. By default, `"nearest"` is used.

drop_duplicates: Boolean, whether to drop duplicate rows based on filename
# Returns
A ImageFileIterator yielding tuples of `(x, y)`
where `x` is a numpy array containing a batch
of images with shape `(batch_size, *target_size, channels)`
and `y` is a numpy array of corresponding labels.
"""

Example code: If you want to try a dataset, you can use https://www.kaggle.com/c/cifar-10/data

Note: has_ext attribute is deprecated, So make sure the x_col column in your dataframe has the entire filenames (including extensions) of the images. The example dataset linked above only has file id(without filename extensions) which can be easily appended with “.png” to convert them as a proper filename using the pandas map or apply function.

import pandas as pd
df=pd.read_csv(r".\train.csv")
datagen=ImageDataGenerator(rescale=1./255)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=".\train_imgs", x_col="id", y_col="label", class_mode="categorical", target_size=(32,32), batch_size=32)
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
input_shape=(32,32,3)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

model.compile(optimizers.rmsprop(lr=0.0001,
loss="categorical_crossentropy", metrics=["accuracy"])
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size
model.fit_generator(generator=train_generator,
steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=valid_generator,
validation_steps=STEP_SIZE_VALID,
epochs=10)