Keras is to TensorFlow/Theano/CNTK what Scikit-Learn is to NumPy.

[Keras] A thing you should know about Keras if you plan to train a deep learning model on a large dataset

TLDR; Understanding this is important if you plan to migrate your keras experiments from toy datasets to large ones:

  • The output of predict_generator() and predict() may not match if you are using a data generator created with flow_from_directory() from ImageDataGenerator(), even if your training data, model architecture, hyperparameters, and random seed are identical.
  • If you are not aware of this, predict_generator() will likely give you garbage results.
  • This is now harder to discover because some functions to deal with this from keras 1.0 are gone in keras 2.0.
  • This also makes it harder to train multi-label models on large datasets.

I have to admit, I have not seen a TLDR; longer than mine.

Context: training on large datasets

Since fit() requires the entire dataset as a numpy array in memory, for larger datasets we have to use fit_generator()

In Keras, using fit() and predict() is fine for smaller datasets which can be loaded into memory. But in practice, for most practical-use cases, almost all datasets are large and cannot be loaded into memory at once.

The solution is to use fit_generator() and predict_generator() with custom data generator functions which can load images to memory during training or predicting. We can write them on our own, but ImageDataGenerator() in Keras provides one such generator, which we can create using flow() or flow_from_directory(). Since most deep learning models likely already use ImageDataGenerator() (because image augmentation is essential), this is a great solution.

flow_from_directory() infers the labels from directory structure

However, there is a difference between how these functions learn the labels associated with the images.

  • To fit(), or fit_generator() using flow() via ImageDataGenerator(), we supply the labels ourselves.
  • flow_from_directory() automatically infers the labels from the directory structure of the folders containing images. Every subfolder inside the training-folder(or validation-folder) will be considered a target class.
flow_from_directory() automatically infers the labels from the directory structure of the folders

The Issue

When one uses flow_from_directory(), the mapping from the class labels (the folder names) to the internal one-hot vectors may not be intuitive.

Let’s say we are working with the CXR8 dataset, which now has 14 different classes. Say we have saved them in folders called Class_1, Class_2, Class_3, …, Class_12, Class_13, Class_14.

If we are passing the labels ourselves, it’ll mostly reflect the order in which we list these labels. If flow_from_directory() is inferring them, it’ll sort the directory names before encoding. In this case, Class_10 will come after Class_1, instead of Class_2 as we might expect. The mapping will look like this:

{'class_14': 5, 'class_10': 1, 'class_11': 2, 'class_12': 3, 'class_13': 4, 'class_2': 6, 'class_3': 7, 'class_1': 0, 'class_6': 10, 'class_7': 11, 'class_4': 8, 'class_5': 9, 'class_8': 12, 'class_9': 13}

This will most likely be different from the intuitive sorting where Class_2 comes after Class_1 (instead of Class_10). In that case, the outputs from predict() and predict_generate() will look different.

It would be a better practice to label your folders without a numeric index, or name them class_01, class_02, etc instead of class_1 and class_2

An Example

This blog post is prompted by the keras issue #3477. The author has two experiments with keras, one with image augmentation (using ImageDataGenerator()) and one without. The author wonders Why the return of predict() and predict_generator() are different? This is an example of different sorting order of labels that we just discussed.

Solution (Code)

Keras 1.0 had a couple of functions for the Sequential api: model.predict_classes() and model.predict_proba(), to deal with this, but they are gone in Keras 2.0, which I think is a good decision. The workflow across both the Sequential and the Functional api should be similar and predictable.

The fix is fairly easy anyway. The mapping from flow_from_directory() is stored inside an attribute called class_indices in the data generator object.

import numpy as np
predictions = model.predict_generator(self.test_generator)        predictions = np.argmax(predictions, axis=-1) #multiple categories
label_map = (train_generator.class_indices)
label_map = dict((v,k) for k,v in label_map.items()) #flip k,v
predictions = [label_map[k] for k in predictions]

This predictions will now likely match the predictions made by fit() or fit_generator() using flow().

We can manually create the mapping ourselves as well. The following code is from this SO post.

from glob import glob
class_names = glob("*") # Reads all the folders in which images are present
class_names = sorted(class_names) # Sorting them
name_id_map = dict(zip(class_names, range(len(class_names))))

The variable name_id_map in the above code contains the same dictionary as the one obtained from class_indices function of flow_from_directory().

Please share this with all your Medium friends and hit that clap button below to spread it around even more. Also add any other tips or tricks that I might have missed below in the comments!