Foundational Machine Learning: A Practical Guide for Linear Regression

Arushi Aggarwal
5 min readJun 11, 2024

--

In our data-driven world, the ability to extract meaningful insights from complex datasets has become a crucial skill. Linear regression is a fundamental statistical technique that serves as a powerful tool for understanding relationships between variables and making informed predictions and insights. Whether you’re an upcoming data scientist, a business analyst, or an experienced researcher, mastering linear regression can open doors to a wide range of applications, from forecasting sales trends to modeling environmental patterns.

In this article, I will build a comprehensive toolkit that you can reference in any case you are learning and applying linear regression.

We will:

  • Understand linear regressions from its theoretical concept to its underlying assumptions
  • Explore the different types of linear regression (simple and multiple)
  • Learn how calculate the line of best fit using the least squares method
  • Implement linear regression models using Python in a way that builds foundational skills
  • Evaluate model performance with metrics such as R-squared and mean square error

What is linear regression?

Linear regression is a fundamental statistical technique used to model the relationship between a dependent (response) variable and one or more independent (explanatory) variables. The goal of linear regression is to find the best-fitting straight line that describes the relationship between these variables. In the context of Machine Learning, it is a supervised learning model that is used in data analsysis as it quantifies a pattern/trend in the data. We can use the line of best fit to predict a response for data points we don’t have the answers.

Underlying Assumptions

There are four main assumptions associated with a linear regression model.

  • Linearity: The relationship between the independent and dependent variables is linear.
  • Independence: The observations in the dataset are independent of each other.
  • Homoscedasticity: The variance of the residuals (errors) is constant across all levels of the independent variable(s).
  • Normality: The residuals follow a normal distribution or, for any value of x, y is normally distributed.

Different Types of Linear Regression

There are two main type of linear regression:

  • Simple Linear Regression: In this model, there is a single independent variable and a dependent variable
  • Multiple Linear Regression: In this model, there is a single more than one independent variable and a dependent variable

An example of a simple linear regression model is when we would want to predict a student’s score on a test (y) based on the number of hours they studied (x). An example of a multiple linear regression model is when we would want to predict a house’s selling price (y) based on its size (x1), the number of bedrooms (x2), and the age of the house (x3). In this article I will mainly be focusing on simple linear regression.

How Linear Regression Works

Linear regression involves finding the line that best fits a given set of data points. We can achieve this by minimizing the sum of the squared differences between the actual data points and the predicted values on the line. This process is known as the “least squares” method. To find the line of best fit we need to estimate the values of the slope (m) and the intercept (b) to minimize the sum of squared residuals (errors). Residuals are the vertical distance between a data point and the line of best fit. I will break down this process and below is an example of a line of best fit with the distance between the observed and predicted point marked.

To start off with, the form of the line of best fit equation is y = mx +b and since we are attempting to reduce the sum of the squares of the errors which is between a data point and predicted value as much as possible, the method is called the least-squares method.

In y = mx + b

  • y is the dependent variable
  • m is the slope of the line
  • x is the independent variable
  • b is the intercept

Essentially we use the least-squares method to find y = mx + b through the following formulas and definitions:

Slope (m) Formula: m=n(∑xy)−(∑x)(∑y) / n(∑x²)−(∑x)²​

Intercept (b) Formula: b=(∑y)−a(∑x) / n​

In which :

  • n is the number of data points,
  • ∑xy is the sum of the product of each pair of x and y values,
  • ∑x is the sum of all x values,
  • ∑y is the sum of all y values,
  • ∑x² is the sum of the squares of x values.

We can also perform these calculations in code too which I will demonstrate below. We will be using the scikit-learn library which is commonly used for Machine Learning.

#import the necessary libraries to use datasets and plot the data

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

#define a dataset using numpy arrays and plot those
X = np.array([20, 35, 42, 86, 55, 40, 65, 75, 92, 80, 25, 30])
y = np.array([34, 56, 49, 67, 49, 48, 60, 68, 75, 70, 38, 45]) + np.random.normal(0, 2, 12)

#convert to a 2D array
X = X.reshape(-1, 1)

#create and fit the linear regression modelmodel = LinearRegression()
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

#plot the data and the regression line
plt.scatter(X[:, 0], y, color='red', label='Data')
plt.plot(X[:, 0], y_pred, color='blue', linewidth=3, label='Linear Regression')
plt.xlabel('Feature 1')
plt.ylabel('Target')
plt.title('Linear Regression Example')
plt.legend()
plt.show()

To break down the code above, we can see that after we imported all the necessary libraries, we created our own array based dataset using numpy and used randomization for the scatterplot so the line isn’t a perfect fit. Next, it is required by scikit-learn’s linear regression model to shape our 1D array X into a 2D array. We then create an instance of the LinearRegression class from scikit-learn and fit the model to the input data X and target values y. To make predictions we can use the fitted linear regression model and then we can plot all the data.

Evaluating Model Performance

In Machine Learning, to measure the overall performance of predictive models we usually look at the following values:

  • Mean Absolute Error (MAE): Measures the average magnitude of errors between the predicted and actual values. It penalizes large errors.
  • Mean Squared Error (MSE): Measures the average of the squared difference between the predicted and actual values. It does not penalize large errors.
  • R-squared (R²) Score: This is the coefficient of determination and is the proportion of the variation in the dependent variable which is predicted from the independent variable.
  • Root Mean Squared Error (RMSE): Measures the average distance between the predicted and actual values. This is an absolute measure of fit and not a normalized measure.

You can use the information above to interpret your results and use sklearn.metrics to import relevant methods such as:

  • mean_absolute_error()
  • mean_squared_error()
  • r2_score()
  • rmse = np.sqrt(mean_squared_error())

Summary

Whether you’re a complete newcomer, a student seeking practical experience, or a professional looking to enhance your skills, I hope this guide equips you with the necessary tools and knowledge to implement a linear regression model and perform machine learning techniques and predictive analysis for any purposes you need.

I hope this helps and please comment down below any questions and I will do my best to answer them.

--

--

Arushi Aggarwal

Arushi Aggarwal is currently a junior at Cornell University studying Computer and Information Science