Dataset Management for Computer Vision

ODSC - Open Data Science
6 min readMar 12, 2021

When building computer vision solutions, the emphasis is usually on the modeling side and on leveraging the latest algorithm. While the model is important, in my experience I have found that an even more important component to delivering a successful solution is to build and maintain a suitable dataset via efficient dataset management.

Why dataset management is key

From a technical standpoint:

Let’s look at things from a representation theory standpoint. For any supervised task, modeling is essentially about teaching a model a representation of the world that allows going from the input data to our labels.

In “classic machine learning,” we do this via feature engineering — essentially changing the surface of the loss function to be more treatable so that it’s easy for a model to learn.

The same thing holds true in computer vision. One difference here is that when using deep learning, we have little explicit control on the feature engineering part of the process.

The job of making the loss landscape smoother can however still be carried out by building and maintaining a suitable dataset. That’s a big, golden knob we can turn and spin to control our experiments with whatever fancy model we use.

From a more intuitive standpoint:

Before training, a model is a blank canvas*. It doesn’t know what a line is, it doesn’t know what an object is — we are teaching it everything we need it to know, and we are doing this by showing it pictures and labels.

This means that we need both:

  • to be extremely careful in what we teach
  • to try and find the easiest ways to pass the message.

* In many cases, one starts with a model that has already been pre-trained, so it actually knows something about the world. There are some nuances to it of course, but the same principles discussed here apply.

What to look out for — the “ah moments”

Looking back at the work that I have done, I tried to distill some key aspects to look out for.

Those listed below have all led me to some “ah” moments, either when building or when maintaining a model, and as such, I now explicitly set them as action points when working on a new model:

  • Disentangle business goals from the modeling task
  • Understand the domain well
  • Think about corner cases, assumptions and express them in the dataset
  • Think about bias in the input data
  • Think about data drift once live — how to catch it, how to act on it
https://odsc.com/boston/

Dataset Management Corner cases and assumptions

There are many cases in which it might be hard for a model to perform well on identifying what we set it to identify.

This could be due to images being of poor quality, being taken from a different angle, being partially obstructed, etc. Many of these can be called “corner cases,” but they might be very prominent up to the point of being key to the success or failure of our project.

Unfortunately, we don’t know how corner they are unless we explicitly ask ourselves the question.

A simple example explains best how I approach asking and trying to answer it:

Let’s imagine we want to teach a model to recognize dogs. In our training set, we might find the pictures below, and we might wonder how to deal with them.

Do we show them to the model? What do we tell the model?

Those are all cases that our model might face when deployed live. If we have these examples on our training dataset, the best thing to do is to explicitly try to tackle them — whether we will be successful or not, and whether “tackling it” means to ignore them or not.

Case 1: full dog

This is a simple example. Here we would simply draw a box around the dog (assuming an Object Detection task for simplicity), and we are good.

Case 2: half-a-dog?

What if instead of this picture, we had one where only got some part of the dog? How would we approach it?

Case 3: what’s that?

Taking case 2 to an extreme, what if we just have cases where we can only see a tail?

What to do in these cases

There are not always clear cut answers. One approach that worked well with me across domains, has been to put myself in the “model’s shoes.” What might it think? What would I want it to think?

If we label the tail alone as “Dog,” we are effectively telling the model that the tail IS a “Dog” — i.e. the tail and a full complete dog are the very same thing.

Leaving aside ontological and philosophical considerations about this (which are also super intriguing!), in practical terms this can be problematic because the model learns based on what it sees. Maybe the model could learn to recognize the head, the ears, and the shape of a dog — but we make the job that much harder by passing it conflicting information.

What’s more interesting is that this is not necessarily bound to fail. Maybe the model can effectively learn underlying concepts such as body parts, without being expressly told so. Or maybe it can’t. (For reference: recent AI research has been investigating this. This is beyond the scope of the article, but look up “disentangling representations”).

My line of action in this case is:

  1. Think if it’s important to take on these corner cases
  2. If it is, try to be explicit about them: I set up an experimental framework to try and take them on.

So back to the example, I would first try and think

  • If we can expect to have many pictures of tails alone in our images for whatever reason
  • What would be the impact on the bottom line if the model failed to recognized tails: maybe it matters, maybe it doesn’t, depending on the application

It’s important to ask ourselves if that matters as accounting for corner cases has trade-offs, and can increase the effort and costs needed to build the model.

If the answers to both questions point at us needing to take on the problem, I would then possibly explicitly annotate the input images so that I can test what works:

  • Use “dog” to represent both full dogs and tails
  • Use “dog” to represent full dogs, ignore tails
  • Use “dog” to represent full dogs, use “tail” to represent tails

An efficient way to do this is to have granular annotations in the input data, which can we then easily combine from code as we wish to run explicit experiments.

That is for now! If you found this article on dataset management interesting I will go more into detail and touch on other points during my ODSC East 2021 talk, “Dataset Management for Computer Vision: Possibly the Most Underrated and Important Component to Delivering Successful Computer Vision Solutions in Real-life.”

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.