The unreasonable effectiveness of combining datasets

Ilia Zintchenko
Ntropy
Published in
3 min readApr 24, 2020

A machine learning model generally gets more and more accurate with more data. However, each new observation also adds less and less new information. The model starts to overtrain on our specific dataset, becoming increasingly blind to outliers. This is an especially dangerous game to play in a rapidly changing macroeconomic environment.

To increase the generalization ability of our model and make it robust against outliers, a simple idea would be to train on multiple datasets at once. The more datasets a single model can perform well on, the less of a “surprise” an outlier observation would be. Let’s look at a toy example.

Imagine two different datasets of pictures of dogs playing with different color balls. The goal is to train a model to locate the ball in each picture. In the first dataset, dogs are playing with blue balls

The model trained on dogs playing with blue balls thinks all balls are blue.

A model trained on this dataset will falsely learn that all balls must be blue. In the second dataset, all dogs are playing with yellow balls.

The model trained on dogs playing with blue balls thinks all balls are yellow.

Equivalently, this model will falsely learn that all balls must be yellow. When a dog playing with an orange ball comes along, both of these models fail to locate it, as neither model has seen a ball of a different color.

Both models 1 and 2 fail to find the orange ball.

If we, however, train a model on both of these datasets simultaneously,

The model trained on images of dogs playing with both blue and yellow balls learns that balls can come in different colors.

it will learn that balls can come in a variety of colors, and be able to locate a ball the orange ball correctly.

Model 3 successfully locates the orange ball.

In a commercial setting, different companies tend to have different distributions of customers and corresponding datasets. Variations in the distributions depend on multiple factors, including geography, market sector, positioning, etc. By training across datasets of multiple organizations, a model can be much better prepared to handle customers it has not seen before.

Ntropy is building a network to allow data consumers to train machine learning models on multi-organization data with minimal engineering overhead and data producers to seamlessly monetise their data pool at no additional privacy risk. Check out https://ntropy.network for more info.

--

--