Split up the Data — Part 1
Machine Learning ???? How can a machine even learn??
For a long period of time I always wondered what this fancy term meant and how would it be actually possible ..!! Let’s figure it out today. To get an essence of what happens when we say Machine is learning ,why not understand in the first place how we learn.
Well..!!! let us take a glance at those days when we learnt mathematics as a kid. I still remember how my maths teacher used to teach us basic addition and subtraction in class .Initially she helped us to count using fingers and then add or subtract those numbers. With a lot of practice and sometimes getting beaten up by her we finally learnt addition /subtraction. Mathematically speaking after lot of practice /training , our brain learnt a function of addition where if we get numbers to add ,it processed in a way to give the sum.
Similarly Machine Learning is all about finding that magic function “f” wherein ,we give some input and desire some output . (EXC for e.g. KNN algo)
Now that we became pretty much comfortable with addition ,then it’s time for the performance evaluation phase aka terror phase of a student life……..”EXAMS”. I remember whenever my teacher used to give the same question that we used to solve in class “2+4 = 6” while learning, most of the students performed very well in exams because we had seen the same question “2+4 = 6” many a times. But if instead she gave a different addition or maybe a word problem not as same taught in class we would not score that much. I hope this word problem was not a nightmare for me alone .
Nevertheless ,what do you think is the best way to evaluate a student’s learning , i.e. whether or not student has actually grasped the concept? Is it to give her/him the same problem in exams which was discussed in class ? Or to give a different question which she/he might not have gone through earlier.
You thought of it correct..!!Obviously different problems in exams would give a good evaluation of a student’s learning as she/he would be able to generalize the concept rather than mugging up.
Similarly in Machine Learning we split our whole dataset into Train data and Test data. While training a ML model we only use Train data and keep it away from Test data so that model should generalize well.
Yaa..!! you guessed it right ,by training a model, I mean learning a magic function “f” .
Oh wait !!! did I use the word Dataset ?
Let’s see what it is . A dataset comprises of features and labels. Briefly features are inputs and labels are outputs. For e.g. we can say that a cat has short ears ,light colored eye ,flat mouth and long whiskers whereas a dog has long ears,dark coloured eyes ,bulging mouth and short whiskers. These are nothing but features because it defines the class or labels ,which is dog and cat.
A lot of such rows and columns collectively is known as Dataset.
With the given training data now we learn the magic function “f” .So that if some input x is fed to function f it will return output y .
And now to evaluate our model that is to find the performance of our model we feed the Test data to the model and then find the accuracy . If model gives a good accuracy we accept it or else we make few changes in our model and then re-evaluate it.
Splitting of train and test data can be in the ratio 80:20 ,70:30 etc. Idea is ,more data must be reserved for training purpose and we can have relatively less data for testing of model
In this way Train and Test data has it’s own importance and care must be taken to keep them separate and independent.
Take away points:
- Training and Testing data must be kept separate always.
- Splitting can be done in the ratio 80:20 ,70:30 etc.
- Dataset comprises of features and labels
- Feature is input and label is output
To be Continued…………..
Part 2 of this Blog is present below:
https://medium.com/@imabhi1216/split-it-up-part-2-8821b862fa90
Originally published at https://myselfdatascientist.blogspot.com on May 11, 2022.