What are Data Mismatch and their potential solutions in Machine Learning?

[ML0to100] — S1E17

3 min readJun 10, 2020

In some cases, it’s easy to get a large amount of data for training, but this data probably won’t be perfectly representative of the data that will be used in production.

For example-

Suppose you want to create a mobile app to take pictures of flowers and automatically determine their species.

You can easily download millions of pictures of flowers on the web, but they won’t be perfectly representative of the pictures that will actually be taken using the app on a mobile device. Perhaps you only have 10,000 representative pictures (i.e., actually taken with the app).

The most important rule to remember is that the validation set and the test set must be as representative as possible of the data you expect to use in production.

So they should be composed exclusively of representative pictures: you can shuffle those representative images and put half in the validation set and a half in the test set (making sure that no duplicates or near-duplicates end up in both sets).

But after training your model on the web pictures, if performance of the model on the validation set is disappointing, you will not know whether this is because your model has overfitting the training set, or whether this is just due to the mismatch between the web pictures and the mobile app pictures.

One solution is to hold out some of the training pictures (from the web) in yet another set that Andrew Ng calls the train-dev set.

So you’ve got 10k representative images and 500k images from web, Create-

train-dev set → Only images from web

train, validation set and test set→ equally mixed images from web and representative images

Train model on the training set-

Evaluate on the train-dev set → If it performs well, then the model is not overfitting the training set.
Evaluate on the train-dev set → If it performs poorly, then it must have overfitted the training set, so you should try to simplify or regularize the model, get more training data, and clean up the training data.
If it performs poorly on the validation set → the problem must be coming from the data mismatch.

You can try to tackle this problem by preprocessing the web images to make them look more like the pictures that will be taken by the mobile app, and then retraining the model.

Read Next- No Free Lunch Theorem! [S1E18]

Summary Cheat Sheets, Notes, Flash Cards, Google Colab Notebooks, codes, etc will all be provided in further lessons as required.

Read through the whole ‘S1’ [ML0to100] series to learn about-

What Machine Learning is, what problems it tries to solve, and the main categories and fundamental concepts of its systems.
The steps in a typical Machine Learning project
Learning by fitting a model to data
Optimizing a cost function
Handling, cleaning and preparing data
Selecting and engineering features
Selecting a model and tuning hyperparameters using cross-validation
The challenges of Machine Learning, in particular, underfitting and overfitting (the bias/variance trade-off)
The most common learning algorithms: Linear and Polynomial Regression, Logistic Regression, k-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forests, and Ensemble methods
Reducing the dimensionality of the training data to fight the “curse of dimensionality”
Other unsupervised learning techniques, including clustering, density estimation, and anomaly detection Part II, Neural Networks and Deep Learning, covers the following topics:
What neural nets are and what they’re good for building and training neural nets using TensorFlow and Keras
The most important neural net architectures: feedforward neural nets for tabular data, convolutional nets for computer vision, recurrent nets and long short-term memory (LSTM) nets for sequence processing, encoder/decoders, and Transformers for natural language processing, autoencoders and generative adversarial networks (GANs) for generative learning
Techniques for training deep neural nets
How to build an agent (e.g., a bot in a game) that can learn good strategies through trial and error, using Reinforcement Learning
Loading and preprocessing large amounts of data efficiently
Training and deploying TensorFlow models at scale

Disclaimer — This series is based on the notes that I created for myself based on various books I’ve read, so some of the text could be an exact quote from some book out there, I’d have mentioned the book but even I don’t know which book a paragraph belongs to as it’s a compilation. It’s best for the reader as they get the best out of the all promising books available in the market for ML compiled in one place.

What are Data Mismatch and their potential solutions in Machine Learning?

[ML0to100] — S1E17

Suppose you want to create a mobile app to take pictures of flowers and automatically determine their species.

Read through the whole ‘S1’ [ML0to100] series to learn about-

Written by Sanidhya Agrawal