All data is not created equal

Why more data isn’t better

Justin Tenuto
Alectio
4 min readMar 26, 2020

--

Perhaps the most pervasive misconception in data science is “the more data, the better.” Just think about how many people you know who have been collecting and hoarding as much of it as possible. Think of how many colleagues and bosses who’ve admonished you to do so. There’s this idea that getting more and more data will make models more accurate. Some even assume that collecting more data can magically fix struggling models.

None of that is true.

None of that is true because not all data is created equal. Broadly speaking, any dataset can be broken into three buckets: useful, useless, and hurtful.

Useful data is, well, useful. It’s well labeled and it helps your model. For a facial recognition model, think a clear, composed photograph of a close friend.

What facial recognition algorithm wouldn’t want this image?

Useless data isn’t inherently helpful or hurtful. This can actually be a pretty big category (and we’ll get into specifics in a moment), but for the same facial recognition model, think about a picture where someone’s wearing novelty sunglasses or you’re seeing mostly the back of their head.

I’m sorry but your glasses are too large

Hurtful data makes your model perform worse. The simplest example? Mislabeled data. If you feed that facial recognition algorithm pictures of cats labeled as human faces, you’d expect some weird results to come out of that model.

Sure, she’s adorable but this isn’t helping our algorithm

In other words, having more data isn’t always a good thing. In fact, if you have more useless or harmful data, it’s actively a bad thing! That’s why smart companies prioritize data curation before training their models and intelligent data collection as they continue to train them. Knowing what you’re feeding your model is paramount in setting it up for success. And it’s exactly what we do here at Alectio.

Now, we realize the examples above were a bit breezy so we want to dig in a bit and give you some additional color. Conceptually, useful data is the easy part. It’s well labeled and it improves performance. We won’t spend much time here, although our process at Alectio centers around identifying this exact sort of data. It’s just that the concept isn’t difficult. It’s the opposite of that old “garbage in, garbage out” adage. Train your model on well-labeled, well-curated data and it’s going to perform better than if you trained on it on the opposite. Academic datasets, notably, are full of this kind of data. It’s why they tend to be so easy to work on.

Hurtful data, meanwhile, isn’t always just a mislabeled record. Think of data that comes from faulty sensors, is somehow corrupted, or just plain spammy. There are many flavors of hurtful data, but the more you train your model on hurtful data, the worse off it’ll be and the harder it will be for you in the future to bring it back to something approaching accurate.

Useless data is a bit more nuanced. Redundant data, for example, isn’t inherently useless. Sticking with our hypothetical facial recognition algorithm, multiple pictures of the same person are certainly redundant, but if the person’s wearing brighter makeup or the image quality varies or the angle of photograph is different? Those could actually be valuable. A lot of redundant labels in a single image (like a bunch of cars in an image used to train an autonomous vehicle algorithm) might be good unless there’s simply too many instances. It’s nuanced and there are no hard-and-fast rules here.

Duplicative data, on the other hand, is more often useless. Here, think of taking twelve pictures back-to-back of the same person in the same lighting. What’s more, duplicative data can generate biases like overfitting due to class imbalances.

Lastly, it’s worth noting that outliers aren’t necessarily good or bad. They’re just atypical. A mislabeled image is an outlier because it’s bad. But take something like images of a car accident when you’re training an autonomous driving model. It’s very real and very important but it might be an outlier in your dataset.

Here at Alectio, we help our customers find the good data in vast, unlabeled datasets. We use active learning, reinforcement learning, meta learning, information theory, entropy analysis, topological data analysis, data shapley, and more to show you what data your model should learn from and which data you should avoid. We’re chiefly interested not in data quality but in data value. We’re here to help you find the right ingredients to make your models successful, all while reducing labeling budgets and timelines.

Stop trying to solve bad model performance by throwing more and more data at underperformers. Because the answer to poor model performance isn’t more data. It’s better data. It’s useful data. It’s removing the useless and harmful information and focusing on the good stuff. That gets you to better performance in less time with less data. And we’d love to help you get there.

Want to learn if we can help? Reach out to info@alectio.com

--

--

Justin Tenuto
Alectio
Editor for

I write about machine learning and artificial intelligence at Alectio (http://www.alectio.com/)