ML Core Concepts: Linear Regression

Emily Strong
The Data Nerd
Published in
5 min readFeb 5, 2022

--

A linear regression plot

Linear regression is one of the simplest machine learning algorithms, and a good place to start to understand some of the core concepts of the field. The algorithm has certain limitations that mean that for many real-world situations it requires modifications (covered in the next few posts), but its simplicity makes it easy to interpret and it is immediately recognizable to anyone who remembers y = mx + b from high school algebra.

What Problem Does It Solve?

A linear regression models the relationship between independent variables and the target value you want to predict, or dependent variable. It can have a single independent variable (known as simple linear regression) or many that have different relationships with the dependent variable (known as multiple linear regression).

This is useful for problems where there is a linear relationship between the variables. An example of such a situation is a home’s sale price. There is typically a price per square foot that is determined by the location, and then other features of the home may add or subtract additional value such as number of bathrooms or needing a lot of repairs.

The Math

The mathematical model of linear regression is:

Linear regression equation

In this model, Y is the dependent variable, beta zero is the intercept, or the value of Y when all X values are zero, betas 1 through p are the unique coefficients for each independent variable, and epsilon is an error term.

There are several ways to arrive at these coefficients and intercept given a set of data to train on, however the most common is ordinary least squares estimation. OLS estimation minimizes the sum of squared residuals, or the difference between the estimated and observed value. Calculating the coefficients for the model is as simple as (X^T * X)^(-1)*X^T*y. In code, this is written as:

import numpy as npbeta = np.dot(np.linag.inv(np.dot(X.T, X)), np.dot(X.T, y))

If you are not familiar with this notation, X is the matrix of independent variables, y is the vector of the dependent variable, and beta is the vector of coefficients. To get a predicted value, you simply take the dot product of features and their coefficients. To calculate an intercept with this method, add a column to your matrix of independent variables with the constant value 1.

Interpretation

One of the greatest strengths of linear regression is its interpretability. In this algorithm, every beta coefficient represents how much the target variable increases or decreases when the corresponding independent variable increases by one unit if all of the other variables are held constant. Going back to our home sale price example, if the price per square foot is $2 and two homes are identical in the features considered for calculating the price except one is 100 square feet larger, we would expect that house to sell for $200 more.

Assumptions and Limitations

Linear regression is what is known as a biased model. This means it makes assumptions about the independent variables that can lead to errors in prediction. The assumptions it makes are:

  1. The dependent variable is a linear combination of the independent variables.
  2. The independent variables are normally distributed. In particular, multivariate normality is assumed, which means that any combination of the variables is normally distributed as well.
  3. None of the independent variables are strongly correlated with each other. High correlations between two or more of the independent variables is known as multicollinearity.
  4. The errors in predictions are independent of each other. When the residuals are not independent, it is known as autocorrelation.
  5. Homoscedasticity (constant variance)— the variance in the errors is constant rather than dependent on the values of the independent variables.

Many real-world data sets do not meet all of these assumptions.

On top of these assumptions, linear regression has a practical limitation on the number of independent variables that can be used. In particular, linear regression is not appropriate if there are more columns of features than rows of training data. Each feature adds a source of error to the model, and the more features there are the noisier the data. We express this as degrees of freedom (which should sound familiar from introductory statistics used in determining statistical significance), which when calculating model error in linear regression is the number of training rows minus the number of features. As the number of features increases, the complexity of the model increases and there is greater risk of overfitting. That is, the model will be able to accurately predict for the training data but will not generalize well.

What can you do to avoid this? If possible, obtain more training data. The more data a model is trained on, the better it will generalize. If that isn’t possible (or in addition to more data), you can use regularization which is a feature of other linear algorithms such as Lasso and Ridge Regression. These are the subjects of future posts, so I will go more into the concept of regularization then but for now know that regularization helps reduce the chances of coefficients having extreme values.

Finally, linear regression can be strongly influenced by outliers. Outlier clipping during data cleansing can address this, as well as regularization. It is also necessary to scale your data when using linear regression so that variables with larger magnitudes of values do not have a disproportionate influence on the prediction.

Related Algorithms

There are quite a few linear algorithms out there, including the previously mentioned Ridge Regression and Lasso, Elastic Net, and Generalized Linear Models. The Scikit-Learn User Guide on Linear Models provides an overview of these and other linear algorithms.

Further Resources

Penn State: STAT 501 Regression Methods Course Notes

Machine Learning Mastery: A Gentle Introduction to Degrees of Freedom in Machine Learning

Linear regression and other key concepts for linear models in real-world settings are covered in my Machine Learning Flashcards: Linear Models deck. Check it out on Etsy!

--

--

Emily Strong
The Data Nerd

Emily Strong is a senior data scientist and data science writer, and the creator of the MABWiser open-source bandit library. https://linktr.ee/thedatanerd