Image for post
Image for post
Keras is to TensorFlow/Theano/CNTK what Scikit-Learn is to NumPy.

[Keras] A thing you should know about Keras if you plan to train a deep learning model on a large dataset

Soumendra P
Jan 23, 2018 · 4 min read

TLDR; Understanding this is important if you plan to migrate your keras experiments from toy datasets to large ones:

  • The output of predict_generator() and predict() may not match if you are using a data generator created with flow_from_directory() from ImageDataGenerator(), even if your training data, model architecture, hyperparameters, and random seed are identical.
  • If you are not aware of this, predict_generator() will likely give you garbage results.
  • This is now harder to discover because some functions to deal with this from keras 1.0 are gone in keras 2.0.
  • This also makes it harder to train multi-label models on large datasets.

I have to admit, I have not seen a TLDR; longer than mine.

Context: training on large datasets

In Keras, using fit() and predict() is fine for smaller datasets which can be loaded into memory. But in practice, for most practical-use cases, almost all datasets are large and cannot be loaded into memory at once.

The solution is to use fit_generator() and predict_generator() with custom data generator functions which can load images to memory during training or predicting. We can write them on our own, but ImageDataGenerator() in Keras provides one such generator, which we can create using flow() or flow_from_directory(). Since most deep learning models likely already use ImageDataGenerator() (because image augmentation is essential), this is a great solution.

However, there is a difference between how these functions learn the labels associated with the images.

  • To fit(), or fit_generator() using flow() via ImageDataGenerator(), we supply the labels ourselves.
  • flow_from_directory() automatically infers the labels from the directory structure of the folders containing images. Every subfolder inside the training-folder(or validation-folder) will be considered a target class.

The Issue

When one uses flow_from_directory(), the mapping from the class labels (the folder names) to the internal one-hot vectors may not be intuitive.

Let’s say we are working with the CXR8 dataset, which now has 14 different classes. Say we have saved them in folders called Class_1, Class_2, Class_3, …, Class_12, Class_13, Class_14.

If we are passing the labels ourselves, it’ll mostly reflect the order in which we list these labels. If flow_from_directory() is inferring them, it’ll sort the directory names before encoding. In this case, Class_10 will come after Class_1, instead of Class_2 as we might expect. The mapping will look like this:

This will most likely be different from the intuitive sorting where Class_2 comes after Class_1 (instead of Class_10). In that case, the outputs from predict() and predict_generate() will look different.

An Example

This blog post is prompted by the keras issue #3477. The author has two experiments with keras, one with image augmentation (using ImageDataGenerator()) and one without. The author wonders Why the return of predict() and predict_generator() are different? This is an example of different sorting order of labels that we just discussed.

Solution (Code)

Keras 1.0 had a couple of functions for the Sequential api: model.predict_classes() and model.predict_proba(), to deal with this, but they are gone in Keras 2.0, which I think is a good decision. The workflow across both the Sequential and the Functional api should be similar and predictable.

The fix is fairly easy anyway. The mapping from flow_from_directory() is stored inside an attribute called class_indices in the data generator object.

This predictions will now likely match the predictions made by fit() or fit_generator() using flow().

We can manually create the mapping ourselves as well. The following code is from this SO post.

The variable name_id_map in the above code contains the same dictionary as the one obtained from class_indices function of flow_from_directory().

Please share this with all your Medium friends and hit that clap button below to spread it around even more. Also add any other tips or tricks that I might have missed below in the comments!

Difference Engine AI

Math. Data. Growth.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store