Machine Learning — 101 (Part — II)
Welcome back again champ, hope you had read the previous blog which is Part-I of ML 101, if not then read up here Machine Learning — 101 (Part-I).
So, in the previous blog, we came across a basic intuition of What ML is. Now we will be peeking in to know about Data.
We haven’t talked about data yet which is really important for any Data Science Problem. Always remember Machine Learning is just a puppet it stays idle the puppeteer (data) does the heavy lifting.
In all Machine Learning problem’s, there are three kinds of data :
- Training data (or) split
- Validation data (or) split
- Test data (or) split
Before into it, imagine a scenario of How you prepare for your exams. Keep this in mind, might help us to understand these splits better.
Training data (or) split
Our machine learning model is initially fitted into a training set. We always train our model on the training set, this is where our model learns.
To make it simple, before appearing for an exam we usually prepare ourselves. This includes study materials, mock tests, etc..
The learning materials (or) course materials we consume for preparing for the exam can be called a training set. Here you are getting trained by learning all the kinds of stuff needed for the exam, this would be the initial step of your preparation before the exam.
This will be often a subset of our whole dataset, in practice, people consider 75% of their data as a training set. With this 75% of data, we will train our model and prepare it to the best.
Validation data (or) split
We have learned something but we gotta evaluate ourselves and figure out how well we are prepared isn’t it?
In Machine Learning, we use a validation set to tune our model. In other fine words, we adjust our model here and evaluate it to figure out are they doing any fine? Did they learn well?
This is considered to be an important step in machine learning, which drives us into an important topic called Generalization. To say an ML model doing good or working, it should be able to generalize well.
Generalization : The ability for a machine learning model to perform well on data it hasn’t seen before.
Our model should be able to perform well on the data which it hasn’t seen before. It’s like the mock tests you undergo before appearing for the exam. In this period, you solve different problems which you haven’t come across while you were learning.
By the time we solve problems, we can able to evaluate where do we stand. If our performance was bad then we go back to consume some more materials and back again to the mock test. This is an iterative process indeed.
We use the validation set, as our mock tests to see how well our model has learned if it doesn’t we go to our training set to make some changes, train them all again, and evaluate the validation set again.
The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters is called a validation set.
For now, think of hyperparameters as an oven where you tune the oven and cook foods. Will come back to this in later blogs.
Test data (or) split
This is the set that remains untouched till the end of the Machine Learning project workflow. After training and tuning your data, this is where you will evaluate your model and compare the results.
The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. It is a set of examples used to assess performance.
When you estimate the generalization error using the test set, your estimate will be too optimistic, and you will launch a system that will not perform well as we expected. This is called data snooping bias.
In analogy, the test set is gonna be your exam itself. Where you appear and perform whatever you have learned and practiced during the preparation.
Things to be clear
- The terms test set and validation set are sometimes used in a way that flips their meaning in both industry and academia.
- In the erroneous usage, the “test set” becomes the development set, and the “validation set” is the independent set used to evaluate the performance of a fully specified classifier (or) model.
- Never train on test data, if you see surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set.
- Don’t pick a test set with different characteristics than the training set.
This is a basic intro to how we will be splitting our data, I missed out on certain things intentionally and explain them during another series of blogs.
Things like cross-validation, hyperparameter tuning, over and underfitting, etc… will be coming up. Hope you like it! Leave feedback. The goal is to make things simpler that’s what I am trying to do with these blogs.
Have a great day!