The Three Crucial Data Sets in Machine Learning: Training, Validation, and Testing Sets

Prasan N H
3 min readNov 30, 2023

--

In the realm of machine learning, the utilization of algorithms capable of learning from data and making predictions is paramount. These algorithms operate by constructing a mathematical model based on input data, a process that involves the careful division of the data into three distinct sets: training, validation (also known as the ‘Model Development’ set), and test sets.

Training Stage:

The journey begins with the training stage, where the model is fitted to a designated training dataset to discern underlying patterns and relationships. Supervised learning methods guide the model, with the training dataset comprising input features paired with corresponding target outputs (labelled examples). The model’s parameters are adjusted based on predictive performance, specifically the training error.

Validation Stage:

Following the training stage, the fitted model is employed to predict responses for observations in a separate dataset known as the validation set. This dataset facilitates an unbiased evaluation of the model, offering insights during the tuning of the model’s hyperparameters. The iterative evaluation of the validation set enables fine-tuning of hyperparameters, ultimately leading to the selection of the model with the lowest error, serving as an estimate of the test error rate on unseen data.

Testing Stage:

The test dataset, distinct from both the training and validation datasets, plays a crucial role in providing a fair and unbiased evaluation of the final model. This dataset, untouched during the training stage, allows for a one-time assessment of the model’s performance. The test set is pivotal in gauging how well the model generalizes to new, unseen data.

Training set , Validation set and Testing set — process architecture diagram
Training, Validation and Testing process architecture

Considerations and Challenges:

Determining the sizes and strategies for dividing datasets into training, test, and validation sets is a nuanced process, heavily dependent on the specific problem at hand and the available data volume. It is imperative to note that the terms ‘validation set’ and ‘test set’ are sometimes used interchangeably in literature, a practice that lacks correctness.

Data Leakage:

A critical aspect of machine learning is the concept of data leakage, or target leakage, which arises when additional information from the test set influences the model training process, resulting in artificially inflated model performance. This phenomenon, often leading to overfitting, poses a risk of poor generalization to new data. Examples of data leakage include feature leakage (duplicate or proxy columns) and row-wise leakage (duplicate rows between datasets due to oversampling or up-sampling).

train, test, valid sets note
train, test, valid sets

Conclusion: Navigating the intricate stages of machine learning involves a meticulous approach to dataset division and vigilant consideration of data leakage. A well-crafted model, free from leakage, ensures optimal performance and generalizability, laying the foundation for successful machine learning endeavours.

--

--

Prasan N H

Currently pursuing MS in Information Science from University of Arizona (2023-2025)