Introducing What To Label

Published in

Lightly

4 min readFeb 24, 2020

Are you curious about research areas such as active, self-supervised, and semi-supervised learning and how we can optimize datasets rather than optimizing deep learning models? You’re in good company, and this blog post will tell you all about it!

In this post, you will get to know more about our journey as an emerging company in this new field and our learnings on why and how we can improve deep learning models by focusing on dataset optimization. However, since enough blogs and tutorials are already covering all aspects in the fields of architecture search, hyperparameter optimizations, or similar topics, we won’t talk about those here.

Illustration showing a common problem with data annotation: Where should I start?

Should I spend time on optimizing my dataset?

A lot of recent research has been focused on fixing a dataset and treating it as a benchmark for various architectures or training and regularization strategies. Recent papers such as Billion-scale semi-supervised learning for image classification or Self-training with Noisy Student use pre-training on larger datasets to boost test accuracy on the famous ImageNet dataset. Still, the focus is on the architecture or the way of training.

We propose another area of research that is less explored — fixing architecture and training methods but varying the training data.
Let me illustrate the process: A common dataset…

Introducing What To Label

Should I spend time on optimizing my dataset?

Written by Igor Susmelj