Cross-Validation in Machine Learning Misinterpreted
In Machine learning we use cross validation which helps to train the model using optimal hyper-parameters. If we say our model performs well on unseen data points, then it generalizes well.
We use K-nn algorithm in this post.
Why can’t we use test data itself to get optimal hyper parameters?
I see many people especially entry level people pose this question. I will explain in detail why test data should not be used. Consider dataset which is randomly sampled in to train and test data.

Note: Train and Test data partitions are mutually exclusive and exhaustive.

Here we are using test data to decide optimal hyper parameters.
Consider for k=3, We got highest accuracy. Then we train the data with k=3 as hyperparameter. We think we got the best model.
Problem:Here we are missing the basic goal which is generalization. The fundamental objective here is to perform well on unseen data points. Actually we are exposing test data to the algorithm which is very hard for the algorithm to generalize.
Solution:Here comes the role of cross validation. We create another random sample from data which is validation data. We use validation data to get the optimal hyper parameters.
Why breaking data in to another partition(validation data) solves the problem?
when validation data is used to find optimal hyper parameter. The test data acts as unseen data points to the trained model. Now when we measure accuracy on test data we will ensure that it generalizes well and perform well on future unseen data points.

