Linear Regression- Data Science Algorithm every Data Scientist should know

Srishti Sawla
7 min readApr 8, 2019

--

Hi folks, I work as a Data Scientist at Bharti AXA General Insurance and while interviewing candidates for Data Science position, the foremost thing I analyse is-candidate’s knowledge on the Basic Data Science Algorithms.One should know the basic Data Science Algorithms in depth before moving to the advanced Algorithms.

Today’s Blog post is all about Linear Regression-the most common Data Science Algorithm every data scientist should know.

source-Internet

Introduction to Linear Regression

Linear Regression is a linear model that assumes a linear relationship between input variables(independent variables ‘x’) and output variable(dependent variable-’y’) such that ‘y’ can be calculated from a linear combination of input variables(x).For single input variable,method is referred to as Simple Linear Linear Regression whereas for multiple input variables it is referred to as Multiple Linear Regression.

Linear Regression Model Representation :

In a Simple Linear Regression Model with single x and y,the form of the model would be-

source : Internet

In higher dimensions when we have more than 1 input variables the line is now replaced by a plane or hyper plane.

source-internet

Assumptions of Linear Regression

source : http://people.duke.edu/~rnau/testing.htm

source-internet

There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction:

(i) linearity and additivity of the relationship between dependent and independent variables:

(a) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.

(b) The slope of that line does not depend on the values of the other variables.

© The effects of different independent variables on the expected value of the dependent variable are additive.

(ii) statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)

(iii) homoscedasticity (constant variance) of the errors

(a) versus time (in the case of time series data)

(b) versus the predictions

© versus any independent variable

(iv) normality of the error distribution.

If any of these assumptions is violated (i.e., if there are nonlinear relationships between dependent and independent variables or the errors exhibit correlation, heteroscedasticity, or non-normality), then the forecasts, confidence intervals, and scientific insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading.

Violations of linearity or additivity are extremely serious: if you fit a linear model to data which are non linearly or non additively related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data.

Violations of independence are potentially very serious in time series regression models: serial correlation in the errors (i.e., correlation between consecutive errors or errors separated by some other number of periods) means that there is room for improvement in the model, and extreme serial correlation is often a symptom of a badly mis-specified model. Serial correlation (also known as auto correlation”) is sometimes a byproduct of a violation of the linearity assumption, as in the case of a simple (i.e., straight) trend line fitted to data which are growing exponentially over time.

Independence can also be violated in non-time-series models if errors tend to always have the same sign under particular conditions, i.e., if the model systematically under predicts or over predicts what will happen when the independent variables have a particular configuration.

Violations of homoscedasticity (which are called “heteroscedasticity”) make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to a small subset of the data (namely the subset where the error variance was largest) when estimating coefficients.

Violations of normality create problems for determining whether model coefficients are significantly different from zero and for calculating confidence intervals for forecasts. Sometimes the error distribution is “skewed” by the presence of a few large outliers. Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates. Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.

Techniques to build a Linear Regression model

Two most common techniques through which a Linear Regression model is built are :

1.Ordinary Least Squares

2.Gradient Descent

3.Regularization

Ordinary Least Squares Method :

Ordinary Least Squares method is used for multiple linear regression. The OLS method corresponds to minimizing the sum of square differences between the observed and predicted values

source-internet

Gradient Descent

When there are one or more inputs you can use a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on your training data.

This operation is called Gradient Descent and works by starting with random values for each coefficient. The sum of the squared errors are calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

When using this method, you must select a learning rate (alpha) parameter that determines the size of the improvement step to take on each iteration of the procedure.

source-internet

Regularization

There are extensions of the training of the linear model called regularization methods. These seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:

  • Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).
  • Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).
source-internet

These methods are effective to use when there is collinearity in your input values and ordinary least squares would over fit the training data.

Applications of Linear Regression

source-internet

Linear Regression is a very powerful statistical technique and can be used to generate insights on consumer behaviour, understanding business and factors influencing profitability. Linear regressions can be used in business to evaluate trends and make estimates or forecasts. For example, if a company’s sales have increased steadily every month for the past few years, by conducting a linear analysis on the sales data with monthly sales, the company could forecast sales in future months.

Linear regression can also be used to analyze the marketing effectiveness, pricing and promotions on sales of a product. For instance, if company XYZ, wants to know if the funds that they have invested in marketing a particular brand has given them substantial return on investment, they can use linear regression. The beauty of linear regression is that it enables us to capture the isolated impacts of each of the marketing campaigns along with controlling the factors that could influence the sales. In real life scenarios there are multiple advertising campaigns that run during the same time period. Supposing two campaigns are run on TV and Radio in parallel, a linear regression can capture the isolated as well as the combined impact of running this ads together.

Linear Regression can be also used to assess risk in financial services or insurance domain. For example, a car insurance company might conduct a linear regression to come up with a suggested premium table using predicted claims to Insured Declared Value ratio. The risk can be assessed based on the attributes of the car, driver information or demographics. The results of such an analysis might guide important business decisions.

In the credit card industry, a financial company maybe interested in minimizing the risk portfolio and wants to understand the top five factors that cause a customer to default. Based on the results the company could implement specific EMI options so as to minimize default among risky customers.

I hope you enjoyed reading.Happy learning!!

--

--