A Keras multithreaded DataFrame generator for millions of image files

Ryan Woodard

Published in

Xandr-Tech

20 min readJul 19, 2017

I am a dog,
as plain as can be,
is there throttle
in your model
for a png like me?

(With props to J.D.)

Update 2017–08–04: We (led by Daniel Austin) will be presenting some results using Deep Learning (and the techniques in this post) to identify sensitive content in online advertising images in the contributed Deep Learning track at DSCO 17 in October.

The Data Science team at AppNexus uses deep learning for various image classification tasks. There are many resources to get you running deep learning models quickly but very few of them provide examples for doing so at scale. To help fill that void, this post and accompanying code demonstrate:

How to use a memory-efficient generator in Keras for deep learning (classification or regression) to process millions of image files using hundreds of GB or more of disk space. We designed it specifically to avoid the use of the following common but inefficient steps: (1) creating extra train, validation and test directories and (2) making copies of the same image files in separate binomial and multinomial directories (like the ubiquitous cats and dogs subdirectories);
How to use the same generator to efficiently implement a merged model. One example of this type of model is when the matrix output of a convolution net is concatenated with a separate standard numerical design matrix (feature matrix) that, like the images collection, is too large to fit into memory. The features of this separate matrix can be thought of as metadata for each image. The resultant matrix is then used as input into, say, a fully connected model to optimize to the known labels or values.

The use of the generator is known for classification (we’ll provide the reference) but here we show how it can be also used for regression. Implementing a generator for a merged model when both branches of the model require large amounts of data that cannot fit into memory seems to be a contribution currently lacking in the standard public sources.

We organize the file names using a simple Pandas DataFrame and, for a merged model, we store and access the auxiliary feature matrix using a disk-based bcolz array. The key is that the ordering of the DataFrame and bcolz array are the same.

One of our main goals is to use the original image files stored in their “natural” resting place on your system, as opposed to making extra copies in an unnecessary directory hierarchy. Who really keeps their images in dogs and cats directories? Databases were invented for this reason and they easily track millions of image files using their associated table ids, regardless of their actual paths on the file system. Our code is used like the .flow_from_directory(directory) method of the Keras ImageDataGenerator class.

This technique also allows the use of images as input into regression models, where the paradigm of naming sample directories after their classification labels breaks down (see this StackOverflow discussion). We think this ability to use the same image organization technique for regression as well as for classification is a useful contribution to the community.

We will demo code and conclude this post with a slightly different implementation of the model described in the Keras blog post from June 2016 (specifically, the section “Training a small convnet from scratch: 80% accuracy in 40 lines of code”).

Thanks to https://github.com/lengstrom/fast-style-transfer

1 tl;dr

Here are the code and some bullet points of what we address:

Keras deep learning with convolutional networks;
Millions of images, hundreds of GB or more of disk space;
Pandas DataFrame solution to organize file names (instead of subdirectory per classification label);
use of original images and file paths, not resized and relocated copies;
(but if you are copying files…) an efficient directory organization technique;
use of same images and paths for classification (binomial, multinomial) and regression;
multithreaded, thread-safe use of CPU to create queue of mini-batch design matrices feeding to GPU doing actual training;
multiple users can use the same image files on network file server;
use of existing database id to track and select files (including random sampling as needed);
merged model trained using this same id to organize other features in a separate design matrix.

Much of what we present here, in terms of code and ideas, already exists in some form in the body of two Keras issues, #1627 and #1638. The regression issue is discussed in a StackOverflow post. Our contribution is more to synthesize and clarify that information and elaborate on it in a tutorial-like way that, we hope, will help others. Our code has been tested with Keras 2.0 with Tensorflow backend.

2 Motivation

As mentioned, this post and accompanying code are about using Keras for deep learning (classification or regression) and efficiently processing millions of image files using hundreds of GB or more of disk space without creating extra copies and sub-directories to organize. We want to use the original image files stored in their “natural” resting place on our system. This technique generalizes to regression models, where the paradigm of naming sample directories after their classification labels breaks down.

This technique works especially well when the mini-batch processing (model training) is done by GPU(s) while the multi-core CPU is building and maintaining a queue of properly transformed mini-batch design matrices. This pre-caching is a multithreaded solution already built into Keras. In our experience, this technique allows the GPU(s) to operate at essentially peak capacity without waiting on image file reading, resizing, transformation, etc., thus minimizing time spent (and, therefore, cost of) using a GPU-backed machine.

As practitioners know, this type of background efficiency is not as glorious as it’s more famous celebrity big sister, “machine learning”, but is absolutely necessary to get something done within a reasonable temporal and cost budget.

+-----------+      +-----+      +--------------+      +-----+
| existing  | ---> | CPU | ---> | mini-batch 1 | ---> | GPU |
| file      |      +-----+      | mini-batch 2 |      +-----+
| server    |                   | mini-batch 3 |
| structure |                   | ...          |
+-----------+                   | mini-batch n |         ^
                                +--------------+         |
                                                         |
      ^               ^              ^          +---------------+
      |               |              |          | The sexy bit, |
      |               |              |          | "machine      |
 +------------------------------------------+   |  learning".   |
 | Necessary, behind the scenes grunt work. |   +---------------+
 +------------------------------------------+

Assuming the file server has enough juice, this technique allows for multiple users/jobs/clusters to read images from the same host while training different models, be it for hyperparameter exploration within the same model family or distinctly different modeling projects.

For our use case (discussed below), we did not want to use standard data augmentation techniques, where a single image is scaled, warped, reversed, Dalíized, etc. Briefly, we are training on advertisement images, where the images are usually carefully curated by the advertisers themselves and the information that we want to identify could be lost by any warping (other than resizing) of the original. Instead, we want to use each of our millions of distinctly different images as stored with us. However, our technique presented here can be generalized to include augmentation of each image using the Keras ImageDataGenerator (see documentation in our provided code).

In order to leverage existing pre-trained models, like VGA-16 and InceptionV3, images should be resized to 224x224 and 299x299, respectively. Instead of creating new directories of resized images, this can and should be done on the fly, as is done with the built-in Keras .flow_from_directory(directory) method of the ImageDataGenerator class, where the combination of a CPU queue and GPU processing is key. We include that ability in what follows.

Remember that this CPU/GPU duality already exists in Keras generator machinery–we are simply rearranging some pieces so that lists of file paths can be used instead of directories of files.

3 Current standard practice for image directories

Standard practice for the binomial model (‘cats’ vs. ‘dogs’) is to make two sub-directories, data/train/cats and data/train/dogs (then two more for the validation and two more for test). Similarly, for the multinomial model, the multiple classes are inferred from the directory structure itself (e.g., data/train/black_hair, data/train/grey_hair, data/train/white_hair.

These directory structure assumptions are built into the standard Keras .flow_from_directory(directory) method of the ImageDataGenerator class, where the documentation describes directory as:

“…path to the target directory. It should contain one subdirectory per class. Any PNG, JPG or BMP images inside each of the subdirectories directory tree will be included in the generator.”

The directory structure assumptions are fine for small datasets but for our use case of millions of images and hundreds of GB of storage, it is inefficient because, namely, we do not want to copy all of those image files (perhaps more than once) or make a spider web of symbolic links in a misguided attempt to “simplify” the copy process.

Most likely, your image files initially do not reside in either of the directory structures just mentioned. Instead, they are very possibly in a directory hierarchy that predates your machine learning application and that reflects some internal database structure/organization of your company that was not designed for classification tasks.

In such a case, we want to build a data generator function that can safely and efficiently process directories of image files and can be sent to the Keras .fit_generator() function.

4 Moving away from the `cats` and `dogs` paradigm: using database `ids`

To make this more concrete, we turn away from cats and dogs and towards our own industry of advertising technology and the associated creatives. Creatives are the actual images for advertisements shown on a web page. Creatives come in many sizes, two of the most common (width x height in pixels) are 300x250 and 728x90. A creative is stored in our company databases with a unique id (the creative_id) and appears in multiple internal (and, sometimes, external-facing) tables so that a single creative_id is associated with dozens of metadata fields (e.g., advertiser, country, campaign, brand category, etc.) and billions of log level transaction fields (web site on which the creative is placed, time stamp and bid price of auction, etc.).

One of our main goals here is to avoid multiple copies of hundreds of GB through our network. That is, if these files already exist somewhere on our system, we would rather have our model pipeline read directly from those existing locations rather than create an arbitrary, short-lived proverbial set of dogs and cats directories. We use a simple thread-safe multi-threaded generator, based on discussion in Keras ticket #1638, using standard Keras and Python tools that can be used in the same way as the Keras .flow_from_directory(directory) method.

5 The database id is associated with other feature data

An image usually is simply one of many features associated with a higher level object on a system. For instance, at AppNexus and in the adtech industry, the image displayed on a web page as an ad is just one feature of the higher level concept creative. A creative is uploaded to our system by an advertiser for a particular advertising campaign for a particular brand. So we see that when a creative is shown, it has associated with it a tuple of (image, advertiser, brand). Each one of these has an id associated with it that represents a row in one of our database tables. For example, two creatives could be represented in our system as:

| creative_id | advertiser_id | brand_id |
|-------------+---------------+----------|
|    49557622 |         87654 |     5678 |
|     3303651 |         98765 |     1234 |

Additional metadata associated with each creative, advertiser and brand through their respective ids are found in other tables in our system.

In reality, we have hundreds of millions of images that eat up terabytes of storage so, while possible it is not practical to make multiple copies or symbolic links for each file. We want to write once (when the creative first lands on our system, usually from an automated client upload) and read multiple times (from our multiple models, as well as other ongoing uses on our system).

Given a creative_id, the actual image file associated with it is found on our system via a method that uses the hash of the creative's content. For demonstration purposes here, we show a simpler function that translates the creative_id into an actual file path that could be used on your system (and which could be easily tweaked to use the hash of the image instead of its id). An example of this directory structure is:

cd /path/to/our/images/
tree imgs
 imgs
 ├── 622
 │   └── 57
 │       └── 49557622.png
 ├── 651
 │   └── 03
 │       ├── 3303651.png
 │       ├── 35903651.png
 │       ├── 44603651.png
 │       └── 95403651.png
 └── 756
     ├── 55
     │   ├── 31255756.png
     │   └── 95555756.png
     ├── 61
     │   ├── 1161756.png
     │   ├── 461756.png
     │   └── 5561756.png
     ├── 64
     │   └── 58164756.png
     └── 67
         ├── 3367756.png
         ├── 3767756.png
         └── 5467756.png

This directory structure is used to balance the number of subdirectories and number of files within each subdirectory. Notice that the subdirectory naming convention (below the top level imgs directory) uses the last three digits of the creative_id of our system and the level below that, the next two higher order digits. In this way, we have 1,000 subdirectories below imgs and 100 subdirectories within each of those. If each of these has 1,000 files, then we have an efficient system to store 100 million files.

Why do it this way? Even at scale, there is always the occasional need to navigate directories and use shell commands ls, cd, du, feh, etc. Exploring a directory structure with the TAB-completion features of most shells is far easier with only 1,000 files or directories than with 100 million all thrown into one big pot. So here we have included the user experience in our design of file organization, where the user is the singular us, the machine learning practitioner(s).

An added benefit of this organization is that each lowest level directory should, in theory, contain a relatively random and balanced (class-wise, binomial and multinomial) sample of images, which is convenient for some pipeline testing. This balanced randomness is because images correlated at the level of the the object id (creative_id in our working example) will be placed in different directories. That is, consider an advertiser who uploads multiple ads (creatives) to our system in the same session so that they are logged in our databases with consecutive values of creative_id (e.g., 234567, 234568, 234569, …). It is likely that these creatives are visually very similar (the same ad image but with the text in German, Spanish or English, depending on the country in which the ad will be shown). Our directory structure ensures that they will be in different directories.

[In case anyone worries about exposing client data, we actually generated the directory and file names above using random numbers from a simple script.]

A simple, contrived example using the ubiquitous dog vs. cat binomial classification and showing the additional multinomial classes, would then look like:

df

       object_id   bi  multi
index                       
0         461756  dog  white
1        1161756  cat  black
2        3303651  dog  white
3        3367756  dog   grey
4        3767756  dog   grey
5        5467756  cat  black
6        5561756  dog  white
7       31255756  cat   grey
8       35903651  cat  black
9       44603651  dog  black
10      49557622  cat  black
11      58164756  dog   grey
12      95403651  cat  white
13      95555756  dog   grey

With the normal approach of “each classification label gets its own directory”, each of the 14 image files in the above table would need to be copied twice: once to the set of binomial directories and once to the set of multinomial directories. Below, we show how to avoid both copies and to use only the original location of each image file.

6 Function to create file path from database id

Let’s return to our initial goal: given just the database ids of millions of images, efficiently feed them into a Keras model. Regardless of the directory structure chosen, a function is needed to translate the id to a file path. Here is an implementation for the above naming convention:

This can even be used to process the Pandas DataFrame above, which makes it handy to keep database id, file path and binary/multinomial labels organized:

Now we see the idea clearly: each object is represented by a single image path but can assume different roles (labels) in different models. We’ve only shown examples of binomial and multinomial but the idea expands easily so that the above DataFrame could include further columns for additional classification labels and regression values.

7 Train a simple Keras model on Kaggle dogs and cats

So we are all on the same page, let’s make sure we have the same setup and data and train a model. We’ll follow the example on the Keras blog by Francois Chollet from June 2016 that has been “updated to the Keras 2.0 API on March 14, 2017”. Our goal will be to use our above file_path_from_db_id() function and directory structure as a drop-in replacement at the appropriate point in this model.

We choose a slightly different setup of cat and dog images from the Keras blog but still use the same data from the Kaggle competition. Here are some shell commands to copy/paste into your terminal to extract a subset of the data:

Then we run the script linked from the Keras blog post, classifier_from_little_data_script_1.py. Since our current goal is to create a more efficient working pipeline, not to produce the best model parameters, changing the number of epochs from 50 to 5 allows us to quickly run and test our baseline. We show here results from a machine without a GPU, so that testing and pipeline debugging can be done on the cheap before more expensive GPU time is required.

Yes, of course, these are horrible model results, as expected because of decreasing the number of training samples (2000), validation samples (888) and epochs (5). But our goal here is an early necessary step in any machine learning problem: to get the pipeline working first. Only then do you make longer runs on a GPU machine with more data to better train the model.

8 Slight detour to rename Kaggle files for the purpose of this demo

The script we use below reads the same Kaggle dogs and cats images but we want to demonstrate the use of the proposed directory structure above. So in our script, there is a function get_demo_data() that creates the new directory structure from the cats and dogs subdirectories and returns a DataFrame with the new file paths. The Pandas gymnastics that does this is in the new_tricks_from_old_dogs() function, which turns this:

into Bill Murray:

Yes, we are copying files here unnecessarily but we do it only to demo the use of a different directory structure.

9 Build a generator using a DataFrame

Finally, given the simple DataFrames above containing a column of the actual filenames on your system (imgpath) instead of separate categorical directories, we can build our generator using the lovely Keras machinery and some threading safety. The main replacements we will make are the following.

From the excellent Keras blog post, the script mentioned previously, classifier_from_little_data_script_1.py, creates the training and validation generators using code similar to:

Now we make a change to demonstrate our main goal: given our above DataFrames df_train and df_valid, create a generator that Keras can use to pre-cache image data for each mini-batch using the file path names. We simply pass these DataFrames to our new generator function. The complete code is in our version of the script, classifier_from_little_data_script_dfgen.py, which imports from akmtdfgen.py. The relevant code is:

Running with these changes (on a machine without a GPU):

As above, these model results are bad because we are only getting our pipeline working with 2,000 training samples, not seeking to properly train our model.

10 Using the full 25,000 image dataset with the new generator on a GPU machine

Let’s run the same code on about 10x more data. First, separate into old-style label subdirectories only so our get_demo_data() function can find it and create the simulated directory structure and DataFrame; in general, you would not make a copy of the image files, you would simply populate the DataFrame with the actual paths to the files (apologies for beating the dead horse on this point):

Now train the model with the new generator on all of the data, but still only for 5 epochs (since our goal is to test the working pipeline, not classify dogs and cats):

How about that: 83% validation accuracy in only 5 epochs, which is about the same as using 10x less data but with data augmentation, as per our hero, Francois Chollet. Of course, this 83% is an early fluctuation and running for 50 epochs as in the blog post reaches convergence of validation accuracy at around 85% (with training accuracy around 90%).

Once all of the above is working, then adding more training data is as easy as adding new rows to the DataFrame that contain the images’ object ids and paths to the actual files. Conceptually, one could think of it as plugging in an external drive of new training images, where the new drive is mounted at /mnt/more/training/data. You will not need to copy those files to your existing data directory and that is the elegance of this technique.

11 Using the generator to train a merged model

A merged model combines the output of two (or more) Sequential models. Commonly, one branch (let’s say the left) would be a convnet trained on image data while the right branch would be some other non-image feature data, such as the output of a LSTM model (see video question example) or simply a “traditional” design matrix of numerical features. A key point relevant to our discussion is that the design matrix for the right branch will have as many rows as there are images in the training, validation and testing data, so that the output of the left model can be concatenated column-wise with the right-side design matrix.

The idea of using a generator for training and validating a merged model has been posed and explored in Keras issues 3386 and 3466 but only for design matrices that can fit into memory. Given our use case of millions of images, if the design matrix consists of thousands of features, then it is unlikely that it will fit into main memory and so we expand on our generator above to address this case.

Let’s elaborate by example: in the simple Sequential model discussed initially in this post, the generator returns a tuple of input matrix X and target vector Y for each mini-batch. Together, these X and Y represent a random sample of the main DataFrame, whose length represents all the samples in one epoch. Each row of X contains the pixel values of the image read from the file path found in the DataFrame and each cell of Y contains the target label for that image. In a merged model, we want an additional row of numerical feature data to be associated with each image. For each mini-batch, then, the generator will return the same X and Y as before and also a second matrix X2 with the same number of rows (the number of columns represents the number of features of this second matrix). All of the mini-batch matrices X2 for the epoch, when stacked on top of each other, are just the total feature matrix for the right-side model that we will call bc (to honor bcolz).

Returning (reluctantly) to our dogs and cats example, one could imagine a set of 12 numbers that represents the monthly weight (normalized, say, to the birth weight) for the first year of growth for each cat or dog in the associated image (not that such data is available for the Kaggle data, I’m just making this up…). The merged model, then, would take into account the (presumably) different growth rates of dogs and cats and might improve upon the images-only model.

Concretely, X is input into the same convolutional network as before. The output of the last convnet layer is concatenated column-wise with X2 and this resultant matrix is now input into a fully connected layer, whose output is compared with Y in the overall loss function.

If the secondary matrix bc can fit into memory, then it can be passed directly to the generator and idx of the entire shuffled DataFrame from the top of the epoch can be used to correctly rearrange bc like:

Then chunks of df and bc for the mini-batch are found using indexes [i:j], where i and j are incremented by the size of the mini-batch during each iteration of the mini-batch loop:

11.1 Feature array is too large for memory

However, our secondary feature data matrix bc is too large to fit into memory: consider a feature matrix for millions of images with thousands of columns. When the main DataFrame is shuffled at the beginning of each epoch and each mini-batch is selected as a chunk, the appropriate rows of bc must be selected in the same order as the shuffled mini-batch. In such a case, we use the splendid efficiency of bcoz.

Let’s say the index column of the shuffled mini-batch of the DataFrame is called idx and contains the values [13, 5, 1, 7]. Using the example above, our mini-batch would look like this:

Those four image files are read and their pixels returned as rows of X. The index is obtained and is used to index into the bcolz array on disk via:

Since bcolz arrays stored on disk are compressed and very fast to read, this is an efficient way to store and use sparse matrices without using a special sparse matrix data structure. There may be a not-insignificant performance hit, since the bcolz file on disk must be accessed during each mini-batch but this is the trade off made for being able to process huge amounts of feature data with limited machine memory. Well, as Billy Bragg once told me, you have to learn to take the crunchy with the smooth, I suppose.

We implemented a merged model with random X2 data for the (sigh, sadly) dogs and cats in the script classifier_from_little_data_script_dfgen_merged.py. It will not improve your model accuracy (really–it's just random feature data we created) but it will demonstrate how to train a merged model with millions of images and with feature data that will not fit into memory. Be sure to run (and understand) the generator testing code by running the other file first, `python akmtdfgen.py`.

Also, see the code for some potential pain involving multithreading if you have an older version of bcolz and/or its underlying blosc compression library. If you use conda (or not), the following will make sure you have the latest: pip install --upgrade bcolz. The docstring of akmtdfgen.py contains the output of bcolz.print_versions() used for this post.

That’s it! Now you have the tools to train your models on as much data as you can store on disk. You will be time (and budget) constrained but no longer machine memory constrained. See (and run!) the accompanying code to really understand how this all works. Keras is a moving target of goodness and we must stay on our toes.

12 Upcoming posts on merged models, `bcolz` and using Spark in lieu of a GPU

In upcoming posts, Daniel Austin and Ashutosh Sanzgiri, my Data Science colleagues and our resident deep learning experts, will further discuss our holistic use of other technologies at AppNexus, elaborating on using merged models for image classification at scale and on using Spark MLlib and a CPU-cluster in lieu of a GPU-backed machine.

13 Acknowledgments

First, I want to give a huge thank you to Daniel Austin and Allison Krug at AppNexus for making this post happen: he drove the machine learning, she drove the writing. Thanks also to Ashutosh Sanzgiri, Lei Hu, Andre Bickford and Mike Wright for providing great, helpful feedback during writing. Besides the great people at Keras (Francois!) and the infinite font of information at StackOverflow, I also want to thank our fantastic colleagues at AppNexus who have helped understand, build and use our deep learning applications: my fellow Portlandians Daniel, Ashutosh and Sam Seljan; and the New York gang of Lei, Moussa Taifi, Kannan Sankaran and Alex Tandy. This has been and continues to be a fun and productive true collaboration with a wonderful team.