A new approach to dataset creation for deep learning

Published in

Diffgram

4 min readJun 4, 2019

How you can use active directories to build active data.

Motivation

Creating datasets is challenging. Usually it’s thought of as a static process. Data is collected for some period of time, and then the data gets labelled, models get trained, and the results are … well the results. It’s difficult to “go back” since the process is broken up into discrete steps.

One of the best ways to improve deep learning performance is to improve the data. This could mean changing the way the data is collected, changing what classes are used to represent the data, or adding more data.

What if there was a way to actively improve datasets? What if we expected the data to change?

Same file, mirrored in two different directories

Imagine if we could:

Change one annotated image and retrain on demand.
Create multiple sets that are automatically updated as the master set is annotated.
Change your datasets as often as you like, without having to manage the versions.

Introducing Diffgram Directories

Directories represent a set of files. A File contains raw data (like an image), and Instances of encoded meaning (box or polygons, etc.). A File works similar to how we think about files normally ie on a Desktop with a few twists.

Mirrored files

A mirrored file will reflect changes across all directories the File is in. Like a shortcut! This means for example, if you wanted to create a separate directory of say all the dogs images you would:

Upload Cat and Dog images
Select the Dog images
Mirror them to the “Dogs” directory.

Now as the images get labelled, the labels will be accessible in both directories.

Export or train from any directory

Simply select the directory when you go to export:

Or train!

Copy, without Instances

When copying a file, it’s optional to copy the Instances. To illustrate the need for this, consider a case where there are 25 examples of one class, for every 1 example of another.

In this case, it may make more sense to actually train two separate models. Imagine a dataset of 1000 images where:

~50 Trees in every image
~2 Dogs in every image

If we naively annotated all the images, we would need 52 * 1,000 instances! 52k instances! But only 2k dogs!

Illustration only. ¹ Plus there’s 4 dogs in this picture!

Instead what if we create a model with say 100 images of Trees. Then create a separate model for the Dogs?

Trees, 100 Images * 50 Instances = 5,000 Instances
Dogs, 3000 * 2 = 6,000 Instances

Now we have more balanced datasets, with 3x as many Dog instances. And we annotated only 11k instances. Or roughly 5x less work!

To do this in Diffgram we:

Upload all the images to the default directory (Let’s assume this will be “Trees”)
Create a new directory “Dogs”
Copy the files. Since we have the option to copy only the images, we could even do this after annotating the trees. ie We can do it at anytime, doesn’t have to be “upfront”.
Annotate only the Dogs in the new directory.
Now when we train a models, we choose the Dogs directory as the source for the Dog model, and the Trees directory as the source for the Trees model. Both models can train at the same time. And can run together locally.