Data Science & Linear Regression (Part I)
Why I have decided to write the article?
I work as a mentor in field of data science and machine learning. My day to day job is to solve the technical doubts of students over the online discussion forum. During my tenure, I have felt that students pay X amount of money for the course , which lacks in content, wrong information being provided . One can learn data science and machine learning for much less than what one pay for the course . There are ample of articles and videos on net, internet has more depth than ocean , it is just we who need to dive and be ready to submerge.
I have decided to keep the article in as layman as I can, I won’t be using much of technical definition. So let’s begin with understanding what is Data Science in brief and then we will end our article with Linear Regression.
What is Data Science?
Data science is a science of extracting hidden patterns from data. We need to focus on the following words of this line :
- Data : It refers to volume of data which is generated daily . When we generate these data points, we leave behind our behavioral patterns, habits in data set, i.e data can describe your behavioral patterns and habits, for example : Mr. Modi always purchase clothes during promotional offers or when there is clearance sale .
- Pattern: Patterns are nothing just trends , association, rules or cluster in dataset.
- a) Trend : example of trend is , with increase of temperature in a day , volume of sale of ice cream increases
- b) Cycles: example of it is : Jackets sells more in winter than summer
- c) Association: example of it is : Product A is always purchased when somebody buys product B
We can determine all of the above terms by performing kinds of statistical analysis over the data and graphical representation , i.e by doing Exploratory Data Analysis a.k.a EDA. I will post in depth article on it in the coming future which will be beneficial for you .To tell you 80% of time of data analyst goes in doing EDA.
Now, asI have told you about what is data science , so let us move to what is machine learning then !
What is Machine Learning?
Kindly look at this picture, give it a thought, I will tell you the essence of it at the end of the article !
Is Machine Learning part of Data Science? Obviously , one does not need to ask this. But why do we use it? We can not express over each and every pattern in the data just only by performing statistical analysis over data and portraying data using visual graphics, thus can not form a single program. Here comes the role of machine learning, as it solves the task which is difficult to solve algorithmically .For example, it can be face recognition on your phone or voice understanding, driving a car (Google Self-Driving Car), diagnose diseases by symptoms (Watson), advise products, books (Amazon), movies (Netflix), music (Spotify), perform the functions of a personal assistant (Siri, Cortana).
Now the question arises, what is difference between a program and machine learning ?
In traditional programming you hard code the behavior of the program, i.e you define the or set of behavior(s) . In machine learning, you leave a lot of that to the machine, machine learns behavior from data.
Wow, till yet I have not told you What is Machine Learning! Let me tell you and explain it with an example , you formulate the definition.
There is a new born baby in my house , his name is Aryan. He is 4 years old. He is really cute and he has been trying to learn to walk. Every time he tries to stand on his legs, he falls (Data). I have been seeing him doing this ample of times , trust me I think I would have been like him back when I was his age. Today, finally, after lot of trials and failures , he finally stood up and took baby steps (Learnt).
By this example, all I am trying to tell you is that, every time as he try to stand but failed was his set of data . Every time he tried , he learnt the angles at which he has to pose his legs in order to stand , i.e he gained experience from the data , i.e he learnt from the data . And finally, Aryan stood up (Output).
I think, this would have cleared your understanding of all the definitions you have read over the net and any online course.
Talking of machine learning algorithm, we divide them in broadly two parts:
- Supervised Learning : In this, algorithm needs some kind of initial help to identify hidden patterns. There are two stages in supervised learning . First is training and model building , second is testing and checking accuracy . In supervised learning, data set has N independent variable(s) & a dependent variable. We predict value of dependent variable based on N independent variable(s). Hence, there exist relationship between independent and dependent variable. Such models are also called alternate hypothesis, because unless you demonstrate and prove it , whatever patterns you have found, are merely a conjecture.
- Unsupervised Learning : This has unlabeled data set, i.e there is no dependent variable . It just helps you to identify hidden patterns in the data. It finds pattern by calculating distances and similarities between data point.
Since we have talked about Supervised Learning , we should move to our first algorithm .
Linear regression is a supervised learning model. In this , models generated are represented in form of equation, where equation can be :
- 2-Dimension -> Equation of line
- 3-Dimension->Equation of Plane
- n-Dimension -> Equation of Hyper Plane
Using this model, Regression tells that we will predict some real number values . The term Linear refers to method used to generate model which uses linear combination of Independent Variable, i.e y is combination of xi independent variables having different weights of same degree.
Equation of Y:
y= Intercept + Coefficient * Independent Variable
With all the set of Independent Variables available in our data set , we need to find, what relationship do they have with the dependent variable. Such relationship between dependent variable and independent variable(s) can be expressed in terms of correlation, which ranges from -1 to 1.
- If correlation is close to 1 or -1, then there is a perfect relationship
- If correlation is close to 0, investigate further on the variable, before dropping that independent variable
Here below formula of correlation is given :
- In linear regression we will treat X-variable(s) as independent variable and Y-variable as dependent variable.
- (X-Xbar) states how dis-similar X is from its central value
- The numerator of the formula is covariance , i.e how does data points vary together in the mathematical space.
- (X-Xbar)(Y-Ybar) is area which is the degree of freedom that data points enjoys in being so far away from the central point
In the above diagram there is the black data point (xi,yi) which is away from their central values .
- Xbar and Ybar are the expected values of prediction but data points are moved away from them because of the random factor in the data.
Now we have got the good independent variables which are correlated with the dependent variable. Now to fit a regression line , our algorithm must find the best surface which minimizes the Sum of Squared errors.
In the above image we show that we have plotted the scatter plot between our independent and dependent variable x & y. The blue points in the plot are the actual data points and red line shows the predicted surface fitted amongst the data points.
Where, Y_hat are the predicted values & Y are the actual values . Our goal is to minimize the SSE .
In the image above , SSR(Sum of Squared Residuals) states the error captured by the model. Closer the SSR to SST(Total Sum of Squares= SSR+SSE) better the model is.
In linear regression there is one more prominent term being used is R² Coefficient of Determination. It tells about how good the linear model best fit line has fitted the data points.
The whole rectangle area (abcd) is the degree of freedom enjoyed by data represented by diagonal line box. Of this area, how area is being captured by predicted values is represented by ‘S’ marked box(dcfe), which states percentage of error captured by your model for that data point.The ratio of degree of freedom enjoyed by data point and percentage of errors captured by your model is called coefficient of determination. Coefficient of determination = dcfe/abcd.
We use another metric to judge accuracy of model called adjusted R². It is R²-fluke . Fluke means relationship between X & Y variables caused by statistical fluke
This was the first part of linear regression covered in this article, just about basic concepts. In next part, I will take you little deeper , talk about working of linear regression , how do we build it from scratch and also from existing libraries. You will find a new article every week. Stay tuned.
Now the essence of the above picture is , most of the time our data is not that sufficient that our model efficiently learn from it. Hence, we let the data pile up and then build a model.