Machine Learning : The Goblet of Solutions (Episode II)

In my last post , I went over the very basics of machine learning. At the moment you only know what classification and clustering mean, and what they can do. Can you write a few lines of code to perform classification, with this much knowledge? No! Your training is not complete yet.

Estimation or prediction of quantities is a problem that machine learning can solve effectively. The accuracy of the solution however, is why there is so much research going on in the field of machine learning. 
First, let’s think of prediction as an educated guess with very high chances of occurring. For instance, when you see dense dark clouds, there is usually a very good chance of rain. It isn’t awfully hard to come up with this prediction, right?
The statistical formation of this educated guess, laced with math and equations is the core of machine learning algorithms. The things that lead us to predict something are called predictors or features or attributes. In the above example, we guessed rain from dense dark clouds and therefore dark clouds are a predictor for rain.


If it always rained when dark clouds showed up, there would be no guessing involved and dark clouds would indicate 100% chances of rain. If you know the very basics of programming, a simple logical if statement checking for dark clouds is suffice to answer whether it will rain or not. 
Unfortunately life is not so easy and the chances of raining when dark clouds appear are any number between 0%–100%. This chance is called probability. Every answer in machine learning or in life has some probability associated with it. Our main aim in machine learning is to give an answer with the highest probability.

Let us dive in more and learn how to use machine learning in practical scenarios, but this time with young Voldemort as an analogy. 
Young Voldemort has shown small potential in the past by burning his closet, chasing high school kids, talking to snakes and all that jazz, but has been unable to grow a long enough nose and has developed some very serious self-esteem issues. Let’s try to predict how long Voldemort’s nose will be in the coming years.


What do we need to predict something like this? Data! The kind of data we are looking for is Voldemort’s family history. Specifically, what was the size of Voldemort’s ancestor’s noses and what can be a potential predictor of this size.

As it turns out, the wand sizes of each of Voldemort’s family member acted as a predictor for their nose size. 
The following data was acquired from old records at Ministry of Magic.

Have you heard of the term pattern used in context with data ? Pattern means “almost” consistent mathematical relationship of columns with each other. The relationship can mean anything. For instance two columns could increase and decrease in quantity together or vice versa. One column could be the square of numbers present in the second column. As a data scientist you are always interested in these relationships

What’s the first thing you should do when you acquire data ? Plot it! Always plot it. Right now these are just numbers and the hidden pattern is almost impossible to see unless you spend time staring at the data. I have hidden a pattern in the data, and you are welcome to find the pattern by just staring at the table. Maybe you can do it, but imagine if this table had a 1000 rows!! Hence, plot it. Please

Here’s what our data looks like when plotted

A few easy deductions from the plot : The data is linear in nature. 
Any data which is directly proportional to each other i.e. increases or decreases together is called linear. Linear data forms a straight line when plotted. 
Highest wand and nose size is Merope’s, while Tom has the smallest wand and nose size. So we deduce that a small wand owner will end up with a small nose in Voldemort’s family!

However, one data point in this plot actually defies this pattern. Look at Thomas’s data point in the plot. The unusual spike is a result of Thomas’s nose being bigger and wand size being medium, which is in direct defiance of our assumption that big nose is a result of big wand size. This data point is called an outlier or anomaly .

The chances of a data point not following a given pattern is pretty random but small. In other words, it won’t hurt you to ignore the outlier, but it may hurt you to keep it as a predictor, because it may mess with the prediction results. 
Getting rid of the outlier results in a smooth linear line, which is what you want when you are just beginning to work with data.

Given Voldemort’s wand size is 13.5 inches, what is his nose size? 
Maybe you randomly estimated a number in your mind, but are you sure it will be accurate to the last digit? I highly doubt it. Science and technology doesn’t work on random guesses.

So how do we predict a nose size for Voldemort, which has very good chances of occurring? We use a machine learning algorithm called Linear Regression.

I explained what regression is in my last post. It is the classification of data in labels which are continuous in nature. That means the data can belong to a large but definite number of labels. In other words, you are predicting the value for data and this value is a continuous number.
 Since this regression is being performed on linear data in this case, the algorithm is called Linear Regression.

The steps we have taken so far are purely data processing and cleaning steps. We haven’t begun the algorithm yet, but we are ready for it. If this is your first time with linear regression then I am sure this much information can also be slightly overwhelming.
I will continue with the main algorithm in my next post. 
After all we gotta get back to Voldemort with answers before the scumbag who lived (Harry Potter) takes total advantage of the dark lord’s insecurities. 
Follow me here to stay updated with my posts.