The unreasonable effectiveness of combining datasets

Ilia Zintchenko
Apr 24 · 3 min read

A machine learning model generally gets more and more accurate with more data. However, each new observation also adds less and less new information. The model starts to overtrain on our specific dataset, becoming increasingly blind to outliers. This is an especially dangerous game to play in a rapidly changing macroeconomic environment.

To increase the generalization ability of our model and make it robust against outliers, a simple idea would be to train on multiple datasets at once. The more datasets a single model can perform well on, the less of a “surprise” an outlier observation would be. Let’s look at a toy example.

Imagine two different datasets of pictures of dogs playing with different color balls. The goal is to train a model to locate the ball in each picture. In the first dataset, dogs are playing with blue balls

Image for post
Image for post
The model trained on dogs playing with blue balls thinks all balls are blue.

A model trained on this dataset will falsely learn that all balls must be blue. In the second dataset, all dogs are playing with yellow balls.

Image for post
Image for post
The model trained on dogs playing with blue balls thinks all balls are yellow.

Equivalently, this model will falsely learn that all balls must be yellow. When a dog playing with an orange ball comes along, both of these models fail to locate it, as neither model has seen a ball of a different color.

Image for post
Image for post
Both models 1 and 2 fail to find the orange ball.

If we, however, train a model on both of these datasets simultaneously,

Image for post
Image for post
The model trained on images of dogs playing with both blue and yellow balls learns that balls can come in different colors.

it will learn that balls can come in a variety of colors, and be able to locate a ball the orange ball correctly.

Image for post
Image for post
Model 3 successfully locates the orange ball.

In a commercial setting, different companies tend to have different distributions of customers and corresponding datasets. Variations in the distributions depend on multiple factors, including geography, market sector, positioning, etc. By training across datasets of multiple organizations, a model can be much better prepared to handle customers it has not seen before.

Ntropy is building a network to allow data consumers to train machine learning models on multi-organization data with minimal engineering overhead and data producers to seamlessly monetise their data pool at no additional privacy risk. Check out https://ntropy.network for more info.

Ntropy

Enabling datasets to communicate across silos

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store