How to optimize your custom Dataset creation ?

6 min readMar 3, 2020

Welcome to my article series on Dataset Creation Optimization. This guide is meant to help you find the right tools and techniques to get you from the idea of your project, to the training of your algorithm as fast as possible.

Indeed, a data science project is often iterative, you will get some data, annotate it, train on it, test the results and then get back to data collection because you don’t obtain the accuracy you wanted. BUT it doesn’t have to be this way and that’s what I will explain to you in this guide.

This guide is in four parts :

Part 1 : Introduction to the tools and techniques
Part 2 : Good practices to assist and accelerate annotation
Part 3 : Use Active Learning to sort the best data to annotate and train
Part 4 : Online Learning and why it’s crucial to break the iteration loop

Why do I need a custom Dataset ?

You woke up this morning with this brilliant idea of application which will leverage Computer Vision or NLP to disrupt the world of tomorrow. Fine, but do you know that you will need to create a huge dataset to create a powerful model ?

I think that you should consider it for two main reasons.

1) There isn’t a pre-trained model for every use-case

While there are plenty of powerful pre-trained models available on the web nowadays, they are usually trained on large public datasets like COCO and will not help you if you want to do anomaly detection on power lines for example.

2) Performance

The second one is the most important, nowadays Neural networks provide really narrow expertise on specific use-cases, we have not yet managed to overcome the generalization of “AI”.

That is to say that maybe you want to build a Deep Learning system to count people in your shop or whatever.

A pre-trained model will surely do the work if your camera is placed on top of your front door, but what will happen if you decide to place your camera elsewhere, from a different point of view that the training Dataset provided ?

Spoiler alert .. Results won’t be great.. Ok.. but not enough to generate value.

So, How do I optimize my custom Dataset creation ?

Custom Dataset creation can be segmented in 3 questions :

Do I have enough pictures in my Dataset ?
Do I have the right pictures in my Dataset ?
How can I annotate them quick ?

I’ll try to briefly answer to each question in this article, but I’ll develop each point deeper in separate articles that will come next !

How many pictures do I need ?

You can find multiple answers to that question, there is a popular “rule of thumb” for training data quantity.

This rule says that you should have 1 000 examples by class, but this could be trickier if you aim for semantic segmentation use cases or object detection.

A note from Alexey (most popular fork of darknet) on how to build an object detection dataset :

For each object which you want to detect — there must be at least 1 similar object in the Training dataset with about the same: shape, side of object, relative size, angle of rotation, tilt, illumination. So desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds — you should preferably have 2000 different images for each class

There are programmatic methods to know if you have enough pictures on your Dataset, I’ll tell you about it in the Part 3 of this article’s series.

Do I have the right pictures in my Dataset ?

ALERT . THIS IS REALLY IMPORTANT.

A huge challenge in computer vision and deep learning is to know if all the pictures in a Dataset are relevant for training.

Not paying attention to this will assure you to pay maximum price in compute instance and spend maximum time in training, why endure this suffering if you can avoid it ?

But how can I determine if a picture is relevant or not ?

You can’t.. But your computer can, At Picsell.ia, we are really concerned with this question, this research field is connected to active learning, we will publish a huge article (Part 3 of this series) about it soon.

But to give you a quick answer, everything boils down to statistical distribution of the picture’s convolution features.

One simple way to see if all your data are relevant would be to extract the features for each images, then using PCA to reduce the dimension of this big matrix and applying clustering to identify the classes that you would like to detect.

Then you could calculate the homogeneity of your cluster, if you have high homogeneity, it’s likely that all your data are not relevant.

But once again, I’ll explain more and give source code in the part 3 of this series.

Now, how can I annotate these pictures quickly ?

Manual annotation is really time consuming, but hopefully there are few techniques that could help you speed up this process.

Pre annotate images with open pre-trained model

Sometimes, you need to annotate object really similar to objects present in COCO dataset for example.

Let’s say you want to annotate basketball players, you can first run a MASK-RCNN on all your pictures and then change labels to match your classes.

French Basketball team winning over USA ;)

Once your picture is pre-annotated, you just have to change the label to have like “shooting”, “defending”, etc.

2. Use computer vision algorithms (without AI )

Deterministic computer vision is really powerful.

Algorithms like GrabCut can precisely segment lot of things in a picture if you help it just a little bit. The use a these type of interactive segmentation can drastically reduce the time spent by images on annotation.

Super-pixel segmentation is an other CV algorithm that can help you a lot, you will be able to select regions generated by this segmentation thus reducing the number of clicks needed dramatically.

3. Deep Learning Based interactive segmentation

An other method to speed up polygons annotations could be CNN based interactive segmentation, the state-of-the-art is Deep Extreme Cut algorithm : https://github.com/scaelles/DEXTR-PyTorch

This allows you to only click on the 4 extreme points of an object (north, south, east, west) to generate segmentation.

4. Online learning

This is the Graal of data annotation, online learning is the concept of training a Neural Network in a sequential way. That is to say you will iteratively train your NN with each new annotation that you’ve created.

Meaning that you will build your model just by annotating data, and this model will label the next images for you, this is a virtuous circle.

You can find a great article about online learning here :

The part 4 of this series will dig deeper into Online learning.

Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL

Humans and animals can learn complex predictive models that allow them to accurately and reliably reason about…

arxiv.org

Thibaut .. All you said is great, but how can I manage all of this by myself ?

Don’t worry, this article series is in 4 parts, I will walk you through most of the things written here !

So stay tuned for the 3 next articles, we will talk maths, python and deep learning applied to Dataset creation !

Part 1 : Introduction to the tools and techniques (You’re ok with that now)
Part 2 : Good practices to assist and accelerate annotation
Part 3 : Use Active Learning to sort the best data to annotate and train
Part 4 : Online Learning and why it’s crucial to break the iteration loop

Don’t forget to support us if you want the other parts to come out quickly !

In the meantime, you can start optimizing your Custom Dataset Creation by using our Web Platform, we’ve already built everything up for you !

You can find our platform here

I wish you a nice day and a great Dataset creation Journey :)