Rajesh S. Brid
Sep 28, 2018 · 18 min read

For taking steps to know about Data Science and Machine Learning, in the third of the series, I shall cover briefly an introduction to Machine Learning, Regression and more specifically Linear Regression.

Introduction to Machine Learning and Regression :

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” — Tom Mitchell, Carnegie Mellon University


Machine learning tasks (algorithms) are typically classified into three broad categories, depending on the nature of the learning “signal” or “feedback” available to a learning system. These are:

  • Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs. Supervised Learning is the Machine Learning task of inferring a function from labelled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyses the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way.
  • Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end.
  • Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal or not. Another example is learning to play a game by playing against an opponent.
Related image
Related image
Image result for supervised and unsupervised machine learning
Image result for supervised and unsupervised machine learning

Regression is a statistical measure that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables). Regression is the analysis of the relation between one variable and some other variable(s), assuming a linear relation. Also referred to as least squares regression and Ordinary Least Squares (OLS). It is very powerful technique for prediction in Machine Learning. And is a Supervised Learning type algorithm.

The overall idea of regression is to examine two things:

(1) does a set of predictor variables do a good job in predicting an outcome (dependent) variable?

(2) which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?

Different types of Regression:

  • Linear Regression
  • Logistic Regression

Linear Regression

Linear regression is the simplest and most widely used statistical technique for predictive modelling (analysis).

The idea is to find the red curve, the blue points are actual samples. With linear regression all points can be connected using a single, straight line. This example uses simple linear regression, where the square of the distance between the red line and each sample point is minimized.

Linear regression is a way to explain the relationship between a dependent variable and one or more explanatory variables using a straight line. It is a special case of regression analysis.

Linear regression was the first type of regression analysis to be studied rigorously. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters. What is more, the statistical properties of the resulting estimators are easier to determine.

“Almost all of statistics is linear regression, and most of what is left over is non-linear regression.” — — Robert I. Jennrich (University of California at L.A.)

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:

  • Linear regression can be used to fit a predictive model to a set of observed values (data). This is useful, if the goal is prediction, or forecasting, or reduction. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.
  • Given a variable y and a number of variables X1, …, Xp that may be related to y, linear regression analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which Xj has no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y.

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.

Some examples of statistical relationships might include:

· Height and weight — as height increases, you’d expect weight to increase, but not perfectly.

· Weight for Age — as the baby grows older, the weight increases.

· Alcohol consumed and blood alcohol content — as alcohol consumption increases, you’d expect one’s blood alcohol content to increase, but not perfectly.

· Vital lung capacity and pack-years of smoking — as amount of smoking increases (as quantified by the number of pack-years of smoking), you’d expect lung function (as quantified by vital lung capacity) to decrease, but not perfectly.

· Driving speed and gas mileage — as driving speed increases, you’d expect gas mileage to decrease, but not perfectly.

Linear regression models try to make the vertical distance between the line and the data points (e.g the residuals) as small as possible. This is called “fitting the line to the data.” Often, linear regression models try to minimize the sum of the squares of the residuals (least squares), but other ways of fitting exist. They include minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or minimizing a penalized version of the least squares loss function as in ridge regression. The least squares approach can also be used to fit models that are not linear.

In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as Y. The variable we are basing our predictions on is called the predictor variable and is referred to as X. When there is only one predictor variable, the prediction method is called simple regression. In simple linear regression, the predictions of Y when plotted as a function of X form a straight line.

For example, let’s take sales numbers for umbrellas for the last 24 months and find out the average monthly rainfall for the same period. Plot this information on a chart, and the regression line will demonstrate the relationship between the independent variable (rainfall) and dependent variable (umbrella sales):

Linear regression analysis
Linear regression analysis
Related image
Related image

Mathematically, a linear regression is defined by this equation:

y = bx + a + ε


· x is an independent variable.

· y is a dependent variable.

· a is the Y-intercept, which is the expected mean value of y when all x variables are equal to 0. On a regression graph, it’s the point

where the line crosses the Y axis.

· b is the slope of a regression line, which is the rate of change for y as x

· ε is the random error term, which is the difference between the actual value of a dependent variable and its predicted value.

The linear regression equation always has an error term because, in real life, predictors are never perfectly precise.


The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual.

Residual = Observed value — Predicted value
e = y — ŷ

Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0.

Residual Plots

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

And the chart below (fig. 1) displays the residual (e) and independent variable (X) as a residual plot. The residual plot shows a fairly random pattern — the first residual is positive, the next two are negative, the fourth is positive, and the last residual is negative. This random pattern indicates that a linear model provides a decent fit to the data.

Residual plot: Random pattern
Residual plot: Random pattern
Residual plot: Random pattern
Residual plot: Random pattern

fig. 1 fig. 2 Random pattern

The residual plot (Fig. 2) show random pattern pattern indicating a good fit for a linear model. A random pattern of residuals supports a linear model; a non-random pattern supports a non-linear model. The sum of the residuals is always zero, whether the data set is linear or nonlinear.


Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.

  • It could have an extreme X value compared to other data points.
  • It could have an extreme Y value compared to other data points.
  • It could have extreme X and Y values.
  • It might be distant from the rest of the data, even without extreme X or Y values.

Each type of outlier is depicted graphically in the scatterplots below.

Extreme X value Extreme Y value

Scatterplot with extreme X value
Scatterplot with extreme X value
Scatterplot with extreme Y value
Scatterplot with extreme Y value

Extreme X and Y Distant data point

Scatterplot with extreme X value
Scatterplot with extreme X value
Scatterplot with extreme Y value
Scatterplot with extreme Y value

An outlier is an extreme value of a variable. The outlier may be quite large or small (where large and small are defined relative to the rest of the sample).

i. An outlier may affect the sample statistics, such as a correlation coefficient. It is possible for an outlier to affect the result, for example, such that we conclude that there is a significant relation when in fact there is none or to conclude that there is no relation when in fact there is a relation.

ii. The researcher must exercise judgment (and caution) when deciding whether to include or exclude an observation.

Influential Points

An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.

This type of analysis is illustrated below. The scatterplots are identical, except that one plot includes an outlier. When the outlier is present, the slope is flatter (-4.10 vs. -3.32); so this outlier would be considered an influential point.

Without Outlier With Outlier

Scatterplot with extreme X value
Scatterplot with extreme X value
Scatterplot with extreme Y value
Scatterplot with extreme Y value

Regression equation: ŷ = 104.78–4.10x Regression equation: ŷ = 97.51–3.32x
Coefficient of determination: R2 = 0.94 Coefficient of determination: R2 = 0.55

The charts below compare regression statistics for another data set with and without an outlier. Here, one chart has a single outlier, located at the high end of the X axis (where x = 24). As a result of that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would be considered an influential point.

Without Outlier With Outlier

Scatterplot with extreme X value
Scatterplot with extreme X value
Scatterplot with extreme Y value
Scatterplot with extreme Y value

Regression equation: ŷ = 92.54–2.5x Regression equation: ŷ = 87.59–1.6x
Slope: b0 = -2.5 Slope: b0 = -1.6
Coefficient of determination: R2 = 0.46 Coefficient of determination: R2 = 0.52

Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller. In the first example above, the coefficient of determination is smaller when the influential point is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).

If your data set includes an influential point, here are some things to consider.

  • An influential point may represent bad data, possibly the result of measurement error. If possible, check the validity of the data point.
  • Compare the decisions that would be made based on regression equations defined with and without the influential point. If the equations lead to contrary decisions, use caution.

Data sets with influential points can be linear or nonlinear. With respect to regression, outliers are influential only if they have a big effect on the regression equation. Sometimes, outliers do not have big effects. For example, when the data set is very large, a single outlier may not have a big effect on the regression equation.

Assumptions of Linear regression :

A. A linear relationship exists between dependent and independent variable. Note: if the relation is not linear, it may be possible to transform one or both variables so that there is a linear relation.

B. The independent variable is uncorrelated with the residuals; that is, the independent variable is not random.

C. The expected value of the disturbance term is zero; i.e. E(i)=0

D. There is a constant variance of the disturbance term; that is, the disturbance or residual terms are all drawn from a distribution with an identical variance. In other words, the disturbance terms are homoskedastistic. [A violation of this is referred to as heteroskedasticity.]

E. The residuals are independently distributed; that is, the residual or disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as autocorrelation.]

F. The disturbance term (a.k.a. residual, a.k.a. error term) is normally distributed.

Image result for assumptions of linear regression
Image result for assumptions of linear regression

Linear regression is the simplest and most widely used statistical technique for predictive modeling. It basically gives us an equation, where we have our features as independent variables, on which our target variable [sales in our case] is dependent upon.

Linear regression equation is represented as :

Where Y is the dependent variable, X’s are the independent variables and all thetas are the coefficients. Coefficients are basically the weights assigned to the features, based on their importance. For example, if we believe that sales of an item would have higher dependency upon the type of location as compared to size of store, it means that sales in a tier 1 city would be more even if it is a smaller outlet than a tier 3 city in a bigger outlet. Therefore, coefficient of location type would be more than that of store size.

For linear regression with only one feature, i.e., only one independent variable, . the equation becomes,

This equation is called a simple linear regression equation, which represents a straight line, where ‘Θ0’ is the intercept, ‘Θ1’ is the slope of the line. Take a look at the plot below between sales and MRP.

We can see that sales of a product increases with increase in its MRP. Therefore the dotted red line represents our regression line or the line of best fit. Now how to find out this line?

The Line of Best Fit

As shown below, there can be so many lines which can be used to estimate Sales according to their MRP. So how would you choose the best fit line or the regression line?

The main purpose of the best fit line is that our predicted values should be closer to our actual or the observed values, because there is no point in predicting values which are far away from the real values. In other words, we tend to minimize the difference between the values predicted by us and the observed values, and which is actually termed as error. Graphical representation of error is as shown below. These errors are also called as residuals. The residuals are indicated by the vertical lines showing the difference between the predicted and actual value.

Our main objective is to find out the error and minimize it. To calculate the error, we know that error is the difference between the value predicted by us and the observed value. Let’s consider three ways through which we can calculate error:

  • Sum of residuals (∑(Y — h(X))) — it might result in cancelling out of positive and negative errors.
  • Sum of the absolute value of residuals (∑|Y-h(X)|) — absolute value would prevent cancellation of errors
  • Sum of square of residuals ( ∑ (Y-h(X))2) — it’s the method mostly used in practice since here we penalize higher error value much more as compared to a smaller one, so that there is a significant difference between making big errors and small errors, which makes it easy to differentiate and select the best fit line.

Therefore, sum of squares of these residuals is denoted by:

where, h(x) is the value predicted by us, h(x) =Θ1*x +Θ0 , y is the actual values and m is the number of rows in the training set.

The cost Function

So let’s say, you increased the size of a particular shop, where you predicted that the sales would be higher. But despite increasing the size, the sales in that shop did not increase that much. So the cost applied in increasing the size of the shop, gave you negative results.

So, we need to minimize these costs. Therefore we introduce a cost function, which is basically used to define and measure the error of the model.

If you look at this equation carefully, it is just similar to sum of squared errors, with just a factor of 1/2m is multiplied in order to ease mathematics.

So in order to improve our prediction, we need to minimize the cost function. For this purpose we use the gradient descent algorithm.

Cost Function
Cost Function

Gradient Descent

Consider an example, we need to find the minimum value of this equation:

Y= 5x + 4x². In mathematics, we take the derivative of this equation with respect to x, and equate it to zero. This gives us the point where this equation is minimum. Therefore substituting that value can give us the minimum value of that equation.

Gradient descent works in a similar manner. It iteratively updates Θ, to find a point where the cost function would be minimum.

Disadvantages of Linear Regression :

Linear regression is a statistical method for examining the relationship between a dependent variable, denoted as y, and one or more independent variables, denoted as x. The dependent variable must be continuous, in that it can take on any value, or at least close to continuous. The independent variables can be of any type. Although linear regression cannot show causation by itself, the dependent variable is usually affected by the independent variables.

§ Linear Regression Is Limited to Linear Relationships

By its nature, linear regression only looks at linear relationships between dependent and independent variables. That is, it assumes there is a straight-line relationship between them. Sometimes this is incorrect. For example, the relationship between income and age is curved, i.e., income tends to rise in the early parts of adulthood, flatten out in later adulthood and decline after people retire. You can tell if this is a problem by looking at graphical representations of the relationships.

Linear Regression Only Looks at the Mean of the Dependent Variable

Linear regression looks at a relationship between the mean of the dependent variable and the independent variables. For example, if you look at the relationship between the birth weight of infants and maternal characteristics such as age, linear regression will look at the average weight of babies born to mothers of different ages. However, sometimes you need to look at the extremes of the dependent variable, e.g., babies are at risk when their weights are low, so you would want to look at the extremes in this example.

Just as the mean is not a complete description of a single variable, linear regression is not a complete description of relationships among variables.

Linear Regression Is Sensitive to Outliers

Outliers are data that are surprising. Outliers can be univariate (based on one variable) or multivariate. If you are looking at age and income, univariate outliers would be things like a person who is 118 years old, or one who made Rs. 12 million last year. A multivariate outlier would be an 18-year-old who made Rs. 200,000. In this case, neither the age nor the income is very extreme, but very few 18-year-old people make that much money. Outliers can have huge effects on the regression.

§ Data Must Be Independent

Linear regression assumes that the data are independent. That means that the scores of one subject (such as a person) have nothing to do with those of another. This is often, but not always, sensible. Two common cases where it does not make sense are clustering in space and time.

A classic example of clustering in space is student test scores, when you have students from various classes, grades, schools and school districts. Students in the same class tend to be similar in many ways, i.e., they often come from the same neighborhoods, they have the same teachers, etc. Thus, they are not independent.

Examples of clustering in time are any studies where you measure the same subjects multiple times. For example, in a study of diet and weight, you might measure each person multiple times. These data are not independent because what a person weighs on one occasion is related to what he or she weighs on other occasions.

Common Problems with Linear Regression

1. Non-linearity of the response-predictor relationships : If the true relationship between the response and predictors is far from linear, then virtually all conclusions that can be drawn from the model are suspect and prediction accuracy can be significantly reduced. Residual plots are a useful graphical tool for identifying non-linearity.

2. Correlation of error terms : An important assumption of linear regression is that the error terms, ϵ1,ϵ2,…,ϵn, are uncorrelated. Correlated error terms can make a model appear to be stronger than it really is.

3. Non-constant variance of error terms : Linear regression also assumes that the error terms have a constant variance. Standard errors, confidence intervals, and hypothesis testing all depend on this assumption. One way to address this problem is to transform the response Y.

4. Outliers : An outlier is a point for which actual point is far from the value predicted by the model. Excluding outliers can result in improved residual standard error (RSE) and improved R² values, usually with negligible impact to the least squares fit but outliers should be removed with caution as it may indicate a missing predictor or other deficiency in the model.

5. High-leverage points : Observations with high leverage are those that have an unusual value for the predictor for the given response. High leverage observations tend to have a sizable impact on the estimated regression line and as a result, removing them can yield improvements in model fit.

6. Collinearity : Collinearity refers to the situation in which two or more predictor variables are closely related to one another. It can pose problems for linear regression because it can make it hard to determine the individual impact of collinear predictors on the response. A way to detect collinearity is to generate a correlation matrix of the predictors.

Multicollinearity is the collinearity which exists between three or more variables even if no pair of variables have high correlation. Multicollinearity can be detected by computing the variance inflation factor (VIF).

One way to handle collinearity is to drop one of the problematic variables. Another way of handling collinearity is to combine the collinear predictors together into a single predictor by some kind of transformation such as an average.


We have now explored the various aspects of Machine Learning an Regression as well as the various terminologies associated with it. We have also seen the benefits and pitfalls of Linear Regression.

Related image
Related image
Related image
Related image

True regression functions are never linear, although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

Linear Regression:

Ø Simplest regression algorithm

Ø Very fast, runs in constant time

Ø Good at numerical data with lots of features

Ø Output from numerical continuous range

Ø Linear hypothesis

Ø Uses Gradient Descent

Related image
Related image
Related image
Related image

“Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.”

- Atul Butte, Stanford University



Rajesh S. Brid

Written by