What Is Validation Data, And What Is It Used For?

Korra Pickell
Artificialis
Published in
3 min readOct 1, 2021

When training a neural network, the ultimate goal is to have your model perform well at a specified task, given data that it has never seen before.

This type of data could be a picture that an end-user took on a mobile application, a text prompt that someone has filled out on your website, and everything in between. Regardless, a good model needs to be able to generalize a given problem, so that it can perform accurately on these brand new inputs.

This is why we split our preexisting data into three sections: training data, validation data, and testing data. Keep in mind, this split is made only after the full data set has been randomly shuffled, to ensure that each of the three sections provide a diverse representation of the problem space.

> Training Data

Typically, around 70% of your entire data set is designated as the training data. This is the data that the model will be directly trained on. During each epoch of training, the training data is fed to the model, typically in chunks at a time, referred to as batches, to save time. The weights of the model are updated based on this data.

The training data is primarily responsible for teaching your model to learn the task you have specified.

> Validation Data

The validation data usually consists of around 10% of the total data set. At the end of each epoch of training, the model is evaluated against the validation data. We look at the results of these evaluations to diagnose and troubleshoot internal model issues such as overfitting and underfitting.

The primary purpose of the validation data is to guide the engineer through the process of tuning hyper-parameters.

The validation data is very similar to the testing data, but the importance of keeping these two sections separate will be explained in a minute.

> Testing Data

The testing data takes up about 20% of the full data set. This section of your data set serves the purpose of giving a final evaluation of the model’s performance.

The model is only evaluated on this data after all overfitting / underfitting problems are diagnosed and rectified by looking at the validation data evaluations.

So what is the difference between Validation Data and Testing Data?

The true purpose of splitting up a data set into these three sections is to give the model a generalized, and unbiased, understanding of the problem space.

A model can never truly be unbiased, but it is the goal of every engineer to get as close to that outcome as possible. During the training process, the model is unfortunately given bias towards certain solutions.

  • The training data biases the model towards the training data
  • You, the engineer, bias the model towards the validation data when you tune hyper-parameters to troubleshoot model issues.
  • Therefore, the testing data is the closest you have to an unbiased evaluation of the performance of the model.

This is why we keep a small portion of our data as a validation set. The more bias we can remove, the better our model can generalize. Simply having training and testing data alone would result in a model that is more likely to fail at generalizing upon new data.

The validation data set allows us to tune hyper-parameters and diagnose issues without the model gaining bias towards the testing data, allowing for an objective evaluation of the model.

--

--

Korra Pickell
Artificialis

Hello! I am Korra, a machine learning enthusiast with primary interests in using AI/ML to expand human capabilities.