Understanding Linear Regression — A Beginner’s Guide

8 min readOct 28, 2023

We are gonna unravel linear regression with some behind-the-scenes math!

Linear regression is a fundamental concept in statistical analysis that allows us to understand and predict the relationship between a dependent variable and one or more independent variables. It is widely used in various fields, such as economics, finance, social sciences, and machine learning.

At its core, linear regression aims to find the best-fitting line (or hyperplane) that represents the relationship between the dependent variable and the independent variable(s). This line (or hyperplane) can then be used to make predictions or estimate values based on new data.
The dependent variable in linear regression refers to the outcome we try to predict or explain. On the other hand, independent variables are factors that may influence or impact the dependent variable.

Applications of Linear Regression

In finance, linear regression is commonly used to analyze stock prices and predict future trends. By examining historical data, analysts can identify patterns and make informed investment decisions.
In healthcare, linear regression helps researchers study the impact of different factors on patient outcomes. It enables them to determine the relationship between variables such as age, lifestyle choices, and disease progression.
Furthermore, linear regression plays a crucial role in marketing and sales. It allows businesses to analyze customer behavior and predict sales based on factors like advertising expenditure or pricing strategies.

Not limited to these fields alone, linear regression finds applications in various other domains such as economics, social sciences, environmental studies, and more. Its versatility makes it an indispensable tool for data analysis and prediction.

Assumptions of Linear Regression

When we decide to apply linear regression to a problem, there are certain assumptions about the data that we are assuming (or know) to be true. If not true, then the linear regression yields unreliable predictions. We often preprocess the data and transform it in order to make these assumptions hold true.

Linearity

Linear regression assumes that there is a linear relationship between the dependent variable and the independent variable(s). The farther from the truth this would be for our data, the lesser would be the reliability of the predictions.

Independence

Another critical assumption is the independence of observations. Each observation should be independent of the others, meaning that there should be no hidden correlation or dependence between the data points. Violation of this assumption can lead to biased and unreliable estimates.

Homoscedasticity

Homoscedasticity refers to the assumption that the variability of the errors (residuals) is constant across all levels of the independent variables. In other words, the spread of the residuals should be uniform as we move along the predictor variables. If the residuals exhibit a pattern, such as a cone or a funnel shape, it may indicate heteroscedasticity, violating the assumption.

Normality of Residuals

The next assumption is the normality of the residuals. The residuals should follow a normal distribution with a mean of zero. Deviations from normality can affect the validity of the statistical tests, confidence intervals, and p-values associated with the regression coefficients. Normality can be assessed by plotting a histogram or using formal statistical tests.

No Multicollinearity

Linear regression assumes that the independent variables are not highly correlated with each other. High correlation between predictors can lead to multicollinearity, which can make it difficult to separate the individual effects of each predictor. Multicollinearity can also result in unstable and unreliable coefficient estimates.

No Outliers

Outliers are observations that deviate significantly from the general pattern of the data. These can have a substantial impact on the regression model, influencing the coefficient estimates and overall fit. Additionally, influential observations, also known as leverage points, can have a disproportionate effect on the regression line. It is essential to identify and scrutinize outliers and influential observations.

Linear Regression — Behind the Scenes

In this section, I will be explaining the math behind linear regression. So buckle up, and let’s get in this together. For the purposes of visualization, I will be using a dataset with only one feature (so that it can be plotted on a 2D graph). I would ensure that the formulas are generalized for any number of features.

Equation of a Straight Line

Remember y=mx+c from high school? Knowing it is the starting point of the linear regression algorithm. Here, m represents the steepness or slope of the straight line, and c is its y-intercept.

In machine learning, we write this equation a little differently:

Here, f(w,b) is called hypotheses (basically the prediction). It denotes the prediction made by the model with weight wand bias b. xrepresents the single feature of the dataset.

In most real-life datasets, hardly any reliable model can be made with a single feature. The hypotheses function used for a model with n features is:

Now, let’s make the notation concise. Consider two vectors, one containing the feature values, and the other containing the corresponding weights.

Here,

Fit a Straight Line to the Data Points

In the above plot, the blue line is a good fit to the data, while the green line fits the data poorly. But, using which metric do we gauge the goodness of a fit? How about we use residual, i.e., the difference between the actual value and the predicted value? That sounds reasonable! The better the fit, lesser would be the residuals.

Plot depicting residuals for the good fit line

Plot depicting residuals for the bad fit line

In order to take into account residuals for all the data points, we would be using a cost function called mean squared error.

Here, m = number of examples the model is being trained on

You might be wondering why the mean of the squared residuals is scaled down by a factor of 2. Well, you would go on to discover that it makes the math a little more elegant. But even if you don’t divide it by 2, the algorithm will work fine.

Cost Function

Now let us further evaluate the expression for MSE.

We will be using J(w,b) to represent the cost function. Our goal is to minimize the value of J . Since J is a function of w and b , we need to determine their respective values which would minimize the cost function.

Gradient Descent

Optimization lies at the core of machine learning algorithms. One of the key techniques for optimization is the gradient descent algorithm. At its essence, gradient descent is used to find the values of weights and bias that would minimize the cost function.

Imagine standing on a mountainside, aiming to reach the lowest valley. You would then want to take steps downhill to descend effectively. Similarly, in optimization, we aim to seek the lowest point on the cost function’s curve. The gradient descent acts as a compass, indicating the direction of steepest ascent. Going against the direction of steepest ascent, would be equivalent to descending to the minima most effectively.

Before we dive into the math behind gradient descent, let’s first have a look at the cost function in case of linear regression, considering a single-feature dataset (so there would be only one weight and a bias).

At the beginning, we will initialize the weight(s) and bias to random values. As we train the model, the weight(s) and bias will be updated such that they approach the minima.

For the models with more than one feature, the above equations can be generalized as follows:

ɑ is used to denote the learning rate of the algorithm. The optimal value of ɑ varies from problem to problem. If it is too low, the training proceeds very slowly. If it is too high, the weights and bias tend to overshoot their minima values and never converge. A lot of trial and error can be needed at times to arrive at the ‘right’ learning rate.

Further evaluating the gradient, we would have something like this:

The process of updating the weights and bias gets repeated till convergence (or until the new and old values have negligible difference). When its over, we have the final weights and bias which we will be using to predict the target variable.

Now suppose the dataset you are training linear regression on has many features (n) and many data points (m). That would mean that while performing gradient descent we need to take the summation of all m training examples, for all n weights, for every iteration. That would be computationally pretty expensive!

One way to circumnavigate this problem is to use mini-batch gradient descent. In this, we use a much smaller subset of the training data for every iteration, where each example is randomly selected. It is computationally efficient, and works just fine (albeit a little noisily)!

Thank you for reading! I hope you learned something new today. May this knowledge guide you on your quest for harnessing the power of data.

Happy learning and model building!

Celebrate and endorse the article by showering it with your appreciation through claps and shares, a gesture that not only boosts my confidence but also ensures the dissemination of knowledge.

Stay tuned for upcoming installments in the series dedicated to unraveling the fundamentals of data science and machine learning.

Also, do connect with me on LinkedIn.