You have learnt Machine Learning all wrong

Liz Waithaka
Women in Technology
3 min readNov 14, 2023
Photo by Javier Allegue Barros on Unsplash

You have been taught by reading Machine Learning Tutorials:

  1. Load your dataset
  2. Preprocess it
  3. Split it(train, validation, and test sets)
  4. Finally, build your model

Please unlearn this process. There is a huge problem with this; which is “foresight bias”.

Imagine you’re presented with a series of historical events, and you’re asked to predict their outcomes. After the events have unfolded, it’s common to feel a sense of ‘I-knew-it-all-along’ confidence, believing that you could have accurately foreseen those events based on the information available at that time.

This only happens as we have information that influences our prediction. The same happens to machine learning algorithims.

Preprocessing data before splitting it into training and testing sets can lead to a specific danger related to foresight bias: inadvertently leaking information from the test set into the training set.

Here’s why it’s a concern:

  1. Foresight in Feature Engineering: During preprocessing, feature engineering might involve techniques that consider the entire dataset, including the test set. This can lead to the creation of features that incorporate information from the test set, effectively providing the training algorithm with insights it shouldn’t have access to during model training.
  2. Information Leakage: If preprocessing involves scaling, normalization, or imputation techniques based on statistical properties of the entire dataset (instead of just the training set), the model might indirectly learn patterns or relationships from the test set. This leads to inflated performance metrics during evaluation but fails to reflect the model’s actual predictive ability on new, unseen data.
  3. Overfitting to Test Data Features: Preprocessing steps, such as handling missing values or outliers, might be influenced by the characteristics of the test set if applied without proper separation of training and testing data. Consequently, the model could overfit to the test set’s peculiarities, compromising its ability to generalize to new, future data.
Photo by Olav Ahrens Røtne on Unsplash

Worry not! There is a solution for this:

  1. Split the dataset before you do anything else. This ensures that preprocessing is performed solely based on information available in the training data, avoiding any information leakage from the test set.
  2. Ensure that imputation of missing values or scaling/normalization techniques are performed based solely on the training data within each fold. This prevents using information from the test set (which should represent unseen future data) for these preprocessing steps.
  3. Evaluate your model on the test data as few times as possoible. Ideally, use your test data only once.
  4. If you reuse your test data, don’t use the results of your model as feedback to improve it. This can lead to a situation where the model becomes overly specialized to perform well on that specific test set but might not generalize well to new, unseen data.

By following these guidelines, you maintain the integrity of model evaluation, ensuring that the model’s performance estimates remain accurate and reliable for real-world application on unseen data.

--

--

Liz Waithaka
Women in Technology

AI Enthusiast || Machine Learning || Data Scientist || StoryTelling || GitHub: https://github.com/liznjoki