How Statistics Plays an Important Role in Machine Learning?

Girish Bhide
5 min readSep 11, 2020

--

The Background –

A couple of years back I started reading about data science, machine learning and analytics. I am a mechanical engineer, so I had no essence of coding at all though I have learned some concepts of statistics during my graduation. When I started reading about machine learning and what are the algorithms used for, I had no clue how statistics plays an important role in executing all the algorithms. Not only the execution of algorithms but also interpreting the outputs of any algorithm. In this article, I will try to explore how statistics and machine learning are related. Let us begin…

Let Us Try to Explore with An Example –

One of the basic algorithms in machine learning is linear regression for regression scenarios.

Buzz Lightyear is like coding language here shows everything is so clear and simple but after looking at what is going behind the code my expression was also like Mr Woody😥.

With the help of coding software or language, it seems to be just a one-line code that executes linear regression for you. Simple isn’t it? Well answer to that question is yes. But what goes behind that line of code is more interesting and rather little bit complex.

First, let us look at the output given by RStudio for linear regression. I will discuss the actual linear regression execution later in this article. The output of code will give you a clear idea why knowledge of statistics is most important for machine learning.

After writing and executing one line of code, the software will give you something like this. Hmm… looks more like cryptography, all those asterisks, p, t values and degrees of freedom etc… This is where statistics comes to our help deciphering all the output given by the software. Those all are statistical terms which help us to find our required interpretation. This was the time when I realized I need to learn more in detail about statistics.

(I am not going to discuss on how to interpret the results from a software here, that I will discuss some other day.🤓)

By now you might get an idea why statistics plays an important role in machine learning. Now let us go back to code which generated this cryptographic output. A one-line code for linear regression in RStudio looks something like this…

Easy right? Let us go behind the scenes and see how it works…

Behind the scenes –

We shall consider a problem in above-mentioned code and output images in which we wish to study the regression of CONSUME (refer it as Y) on variables PRICE, INC, TEMP etc.… (refer it is as X1, X2, X3…) We shall assume that for each value of X, the variable Y can be represented in the form of –

Y = β0 + β1*x + ε

Where ε is a random variable that has a normal distribution with mean 0 and variance σ².

“Regression analysis is the hydrogen bomb of the statistics arsenal.”

― Charles Wheelan

Here the term linear refers to the fact that the regression function E(Y|x) = β0 + β1*x is a linear function of the parameters β0 and β1. This how the statistical model for linear regression looks like.

The important thing in this is we made some assumption on which we execute linear regression.

1. Normality: we assumed that each variable X is normally distributed.

2. Independence: we assumed that variable X is independent.

3. Homoscedasticity: we assumed that all the variables X having the same variance.

Now, without knowing about what statistics goes in the background and running the model will give you the results. But those results will not support your “hypothesis” if the assumptions and other important parameters were not taken under consideration.

Hypothesis again a term which is used in statistic and research also. A simple meaning of that is you are trying to prove one statement based on data you have. So, with the help of your data either you support your statement (Hypothesis) or you reject it.

Okay, so this is what goes in the background (not mentioned all the steps which occurs in the background just a glimpse of it discussed. I do not want to irritate you by adding lots of equations to this 😀) when we try to execute code. The software does all the complex work for us and gives us a clean output.

My Opinion on Importance Of Statistics –

Statistics is the base of machine learning. With the help of statistical knowledge going through machine learning algorithms becomes much easier and more interesting. We can choose the perfect model to fit our dataset so that we can get expected results, or we can say support our hypothesis.

Interpreting our results becomes easier and it is also very useful if any errors or some procedures are going wrong then to rectify them.

So, I will suggest that before heading to machine learning algorithms directly, one should get knowledge about statistical methodologies which are helpful in machine learning.

Comment below if anyone noticed regression line in The Matrix 😁😉

Cheers !!!

--

--