A Simple Introduction to Training, Validation, and Testing of a Model- Part 1

Sumit Kumar
Analytics Vidhya
Published in
7 min readSep 1, 2020

Decoding the importance of validation and test set for everyone.

In this article, we will be learning the importance of the validation set and the techniques used to split the original dataset into subsets (train, validation, and test). We will first understand how it works followed by the code for a better learning experience.

It is essential to test our model on unseen data to check if it will generalize to new cases.

There can be two ways to check the performance of the data :

  1. To generate the model and put it directly into the production, in this way we can see how it performs on new(unseen) data but if the model is not good the users will not be happy.
  2. The other smart way to do it is to split the data into two parts and then use one to train the model whilst keeping the other for testing. The error rate produced by the test set is also called a generalization error.

We usually go with the second method as it is more safe and reliable.

In the figure below, we can see how we keep a chunk of data as training and the other for testing. Usually, we take the test data as 20% of the original data but it can be changed as per the requirement.

Model Building is an iterative process and therefore once we build our model we keep improving it.

The steps involved in model building are:

  1. Hypothesis Generation
  2. Dataset Creation
  3. Modeling
  4. Evaluation

We have discussed it the previous article, here’s the link to it.

How can we decide if the model fits the data?

Model evaluation is based on the performance of the model on test data. Therefore for some cases, we might need some hyperparameters to apply changes to our model and make it perform better.

Evaluation metrics such as generalizing error help us to compare different models and decide which one has a better model fit.

Based on the model-fit we can classify it into 3 different categories:

  1. Under-fit
  2. Over-fit
  3. Best-fit

We will be discussing these in detail in coming articles, for now, all we need to know is that we want to achieve a good fit and both overfitting and underfitting are not preferred for the model.

Problems linked with splitting the dataset into two: train and test

Suppose we did make changes to our model multiple times and we finally achieve a lower generalization error(example: 5%), we then launch our model and it ends up not performing well.

What do you think went wrong here?

Well, when we made changes to our model multiple times to achieve a lower generalization error on our test data. This indicates that we are not testing our model on completely new data(unseen).

This might lead to a state where the model will not generalize well.

Solution: Creating a Validation Set

To solve this issue, we will use a Validation Set.

We can split the existing dataset into three parts, train, validate, and test.

Now that we have three sets we will use the training set to train the model, the validation set to optimize the model, and the test set to check how the model performs on unseen data.

In the figure below, we can see how we split the data into train, validation, and testing. Usually, we use this proportion but it can be changed as per the requirement.

How to create a Validation Set?

Techniques used to generate the validation set :

  • Hold-out Validation
  • Stratified Hold-out Validation
  • k-fold cross-validation
  • Leave one out validation

In this article, we will be learning the first two techniques and the rest we will learn in future articles.

Hold-out Validation

Steps involved to carry out this technique are:

  1. Take the data and shuffle it (randomize the order of the rows)
  2. Split the data into train and test
  3. Split the training data further into train and validation set

This technique is simple as all we need to do is to take out some parts of the original dataset and use it for test and validation. The splitting of data can easily be done using various libraries in python( an interpreted, high-level general-purpose programming language) like sklearn.

Issues linked with Hold-out validation

The distribution of the variables in each set(train, validation, test) is different, therefore our model will not be able to generalize well.

Stratified Hold-out Validation

The issues related to the Hold-out validation technique are solved in this technique.

Here we will make sure that each set has got similar distribution which will eventually help us generate a better model.

Now that we know what these two techniques are, let’s have a look at the code

We will be using python 3.0

Libraries used:

  • Pandas
  • Numpy
  • Matplotlib
  • Sklearn

We will be using preprocessed titanic dataset here to understand how hold-out and stratified hold-out techniques work:

Here df will now have the dataset that we want to use.

We can see that the data has got 5 rows and 25 columns, where Survived is our target(dependent) variable and the rest are the independent variables.

Even though we are working with clean data, we will still check if there are any missing values. As we can see we have no missing values in our dataset.

We will be storing dependent variables as df_x and the target variable(independent) as df_y.

We will now import train_test_split function from sklearn library as it provides a very simple function to split our data.

Here, we will not use stratification for Hold-out Validation. Setting random state so that each time we run the code we can get the same output for the splits.

Now, we will use train_test_split again but this time we will use training data and split it into training and validation set.

We have a train, validation, and test set. Let’s check the distribution of our target class in all the sets.

As we can see the distribution in each set is not similar therefore our model will not be able to generalize well.

The solution to this problem is Stratified Hold-out Validation

Let’s see how it works.

We will be using the same line of code here as well. The only difference is that we use stratification here.

In this case, we stratify the data with respect to our target variable as we can see df_y is our subset of the data with the target class variable.

It can be seen that the distribution of the target class is now similar which is a good thing as our model will now be able to generalize well.

Issues with Stratified Hold-out Validation Technique

The problem with hold-out and stratified hold-out validation technique is that in order to generate a validation set we take a subset of the training set which we cannot use for training. Therefore we will have less data for training which can be a disadvantage.

Also, since we will be working with one validation set model might overfit the data again.

Solution:

This issue can be resolved by using K-fold Cross-Validation. We will talk about K-fold Cross-Validation and Leave One-Out Validation techniques in the next article.

REFERENCES

  • Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 1st Edition by Aurélien Géron, Chapter1.
  • Applied Machine Learning Course- Analytics Vidhya

Congratulations! You just finished learning the following topics:

  • Importance of train test split.
  • Importance of validation set
  • Techniques used to generate validation set and their disadvantages.

In case if you have any questions, you can post it in the comments, I would be more than happy to address them.

You can also find me on LinkedIn.

Any suggestions for improvement and feedback will be appreciated.

If you like my work, please consider following me, I will be writing more articles on Data Science.

--

--

Sumit Kumar
Analytics Vidhya

Data Engineer, Sydney | Masters in Data Science at Macquarie University, Sydney | Active Learner