Writing an Image Data Preprocessor (using DAVIS 2019)

6 min readMar 6, 2019

Montage of images from the DAVIS dataset (credit: DAVIS Challenge)

When starting in any new field, always the hardest thing to do is to just jump in and get started playing around. In deep learning, the first thing (and the linchpin, usually) is to look at the data, so we’ll want an organized way to load the image data and work with in our Python code.

The goal here is to get you from having a dataset to implementing a basic (but extensible) image processing pipeline that we can feed straight into Keras. Our working example will the DAVIS 2019 Challenge dataset, but this will apply to other image-based datasets (Berkeley DeepDrive 100K, nuScenes 3D Detection, Google Image Captioning, etc.) and most of it will also just apply to any supervised dataset.

The Parts

Keeping Track of the Data
Dealing with Data Sets using Generators
Putting It All Together (in a Class)

Keeping Track of the Data

It’s not that we’re afraid we’re going to lose the data; we just want an organized way to access the images.

If we just have disparate images, then we just need a list of the filenames for the images. We can generate a list of all the files in a particular directory using the os package.

Now, files is a list of (the filenames for) all the images that we have access to in that folder. os.path.exists(path) checks if the path exists on the filesystem, os.walk(path) returns a generator that “walks” through the folder directory starting at the path, and os.path.join(path1, path2, ...) takes multiple parts of a path (in this case, the path to the folder and then the filename) and make a single path string (taking care of “/” so you don’t have to).

If we have videos, that only makes our code a little bit more complex (depending on how “video” information is stored). Instead of storing a list of all the images, we’ll store a dictionary, where keys are the video names and the values are lists of the images in that video. In DAVIS, images are placed in folders based on the video, so we can get the list of videos (and the lists of images) pretty easily.

Why do we want to keep the images sorted by video though? Sometimes, we want to be able to just see the images from a single video source. Also, our validation split should separate by video. If there are images from a video in both the training and validation set, the validation scores are not as meaningful as they should be (look up “data leakage”).

We can use the same code to load the files for either the input images or the output (target) masks.

Note: One very important thing I’m missing out on is linking the image to its corresponding mask. Might be a fun exercise to figure out.

Loading Images

Now that we have the file paths for images we want to load, there are a lot of image processing libraries out there in Python that can load images: matplotlib, scikit-image, opencv, pillow, and imageio, just to name a few. The code is really simple and almost identical in every case:

Different ways to read images from an image file in Python

I would recommend going with one of the first 3 (matplotlib, scikit-image, or opencv), as those will return a numpy ndarray. (You’d have to convert the other ones to ndarray before they’re useful.)

When scripting this, to check that the image loads correctly, just plot the image using matplotlib: plt.imshow(img).

Loading Instance Masks

While we can load the output masks as images using the code above, we also need to do some preprocessing on these images before they can be used for training. The big issue is that we need to one-hot encode the images. They usually come as a single channel (occasionally 3), but need to be one-hot encoded into a 3D numpy array. There’s a lot of code out there to do this for you (you could easily find it on StackOverflow, GitHub, or on a Kaggle starter kernel), but I think it’s worth the exercise to do it once yourself.

Dealing with Large Data Sets using Generators

In deep learning, we usually work with very large datasets (on the order of several hundred gigabytes to terabytes). Most of the time, the entire dataset will not fit into memory (even if it does, we don’t really want to clog up memory with data we won’t be using for a while), so we need to use a generator pattern to load a few batches at a time.

Our goal is to be able to do this:

While this looks just like a normal for-loop in Python, the generator’s doing something special under the loop. A for-loop over a list of data will first create the list (loading all of the data into memory first) and then work on each element. A for-loop applied to a generator will instead be lazy and only load the next element into memory when it’s asked for. At any point in time, only a few elements from the data set are in memory.

Okay, so that’s the goal, how do we actually make a generator?

An Example: Fibonacci Numbers

Let’s step back and look at a simpler example: generating an infinite stream of Fibonacci numbers. If you remember, the rule for generating the next Fibonacci number is to add the previous two. If we were to print out a stream of Fibonacci numbers forever, the code would look something like this:

Printing a stream of Fibonacci numbers without generators

To restyle this as a generator, we write a function that does all of this, but instead of returning prev or curr , it will yield the next number. This works because a yield call doesn’t just return control of execution back to the caller, but yields it and expects to get it back at some point. All of the local variables of the function are saved, and the function continues executing where it left off.

Fibonacci generator function

There’s actually a lot more “magic” going on under the hood with Python here than I’m describing. I won’t go into that here, but if you’re interested in the details, check out Jeff Knupp’s comprehensive article on generators here.

Generators for Data

The pattern for data generators that work with Keras is pretty similar.

General pattern for what data generators look like

Let’s fill in the details using the DAVIS data set:

Sample data generator for the DAVIS data set

Using Generators

Now that we have a generator for our data, we can use it ourselves in a for-loop like above (e.g. to print out the input image and output masks to compare), but we don’t have to do that for training Keras models. The Keras Model and Sequential classes have methods of different “flavors.” You have the usual fit(), predict(), and evaluate() methods that take the entire data set as a parameter, but you also have versions that take generators as parameters: fit_generator(), predict_generator(), and evaluate_generator().

Options for using the generator

Putting It All Together in a Class

To streamline this process, it would be nice to put all of this data preprocessing work into a class (one class for every dataset we use).

Full DAVIS preprocessor class

And with all that work done, it’s really simple to use:

Simple usage of the preprocessor class

Hopefully this short tutorial will give you some control over the huge amounts of data you’re working with (or even if they’re smaller amounts of data). There’s a lot of things that weren’t covered that you need to make this fully functional though. Some ideas:

Need a way to store/access the output classes for each instance (maybe also instance IDs)
A lot of other preprocessing steps can be added (normalization, one-hot encoding, scaling/padding, augmentation, etc.) as needed
Matching input images with their corresponding masks
Splitting into train/val sets (on videos)
Parameterizing the generate_data() method (e.g. do you always want to shuffle?)

Let me know if you have any questions or suggestions!