Demystifying Machine Learning: Simple Linear Regression

12 min readFeb 25, 2024

In today's landscape, machine learning techniques have become accessible through drag-and-drop software solutions and pre-built libraries within programming languages.

However, understanding the inner workings remains invaluable. Comprehension of what is happening under the hood empowers a practitioner to better understand the outcomes as well as recognize the limitations of certain techniques.

This article unravels one of the most elementary machine learning algorithms, namely linear regression.

It explores core concepts of supervised machine learning and the distinction between independent and dependent variables. Furthermore, we will delve into the mathematics. Finally, I’ll illustrate a simple example using Python to predict Ethereum’s price through the Bitcoin price.

Linear Regression

Linear regression is a statistical technique and a popular machine-learning algorithm, used for predictive analysis.

It is commonly the introductory technique taught in machine learning courses, given its simplicity. Moreover, it provides a proper foundation for machine concepts used in many other machine-learning algorithms.

Linear regression is used in various fields serving various purposes. E.g. the medical field to study the relationship between blood pressure and variables such as weight and sex, the financial sector to determine the relationship between profitability and factors such as credit risk and liquidity, and the retail sector to forecast sales through traffic and conversion.

Definition

It is a supervised machine learning model that is useful when your independent and dependent variables have a linear relationship and the dependent variable i.e. what you wish to predict is of a continuous quantity. It uses the line of best fit to establish a relationship between the two variables.

This might sound like jibberish if you have never been introduced to these concepts so below it is broken down.

Supervised Machine Learning

Supervised machine learning is one of four general types of machine learning paradigms. In supervised machine learning a model learns the relationship between input-output pair examples (labeled data). Upon establishing a relationship, the model can predict unlabeled (unseen) data, solely by processing inputs.

In supervised machine learning data is split into two parts: training data (the input-output pairs) and test data (inputs, where outputs are hidden). The training data is used to train a model. The test data is used to measure how well the model performs on unseen data as illustrated below.

Supervised Machine learning — Source: https://www.javatpoint.com/supervised-machine-learning

The other 3 models that are out of the scope of this article are unsupervised machine learning, semi-supervised machine learning, and reinforcement learning.

Dependent and Independent

The inputs and outputs are respectively our independent en dependent variables. Both go by many names (be prepared to learn a bunch of synonyms).

Independent variables

The independent variables a.k.a regressors, controlled variables, explanatory variables, predictors, or simply X, are variables that serve as the input for a certain function. Where the magnitudes of these values are related to the magnitudes of the dependent variable.

Dependent variables

The dependent variables a.k.a regressand, response variable, target variable, predicted variable, or simply Y, is the value we wish to predict and that is subject to (a function of) the independent variable.

Example data

Let’s simplify this with an example. Say we wish to predict the prices of houses. We have a dataset that contains the columns, name owner, square meters, and price.

We can quite confidently say that the name of the owner does not relate to the value of a house (perhaps unless the name is Rothchild).

Generally speaking, however, the larger a house the more expensive it tends to be. Of course, more factors influence the price of a house but for the sake of simplicity, we will use one variable: square meters.

Since we aim to predict housing prices, these are our dependent variables (Y), because they are “dependent” on our independent variable X.

Housing data set: Example of first 7 rows. Source: Author

Solely using one independent variable is known as simple linear regression.

Continuous and discrete values

Linear Regression is a useful technique to predict continuous values. Continuous values are values that can be expressed as decimals and can therefore take an infinite number of values between a given interval. Examples here are prices, heights, distances, and volume.

Discrete values are countable, categorical values that take a finite set of possibilities. i.e. 0 or 1; cat, dog, and bird? For these types of values, other machine learning techniques are more suitable e.g. Logistic Regression or Decision trees.

Linear relationship

For linear regression to be an effective technique, a linear relationship defines the type of relationship an independent variable and dependent variable should have. It means that an increase in our independent variable should result in a proportional increase in our independent variable.

Let’s visualize the housing data by plotting the independent variable square meters against the dependent variable price on a scatter plot. I have created dummy training data for this example.

We can discern from the figure that with each increase in square meters, there’s a corresponding effect on the price column, resembling an almost vertical line across the graph.

Scatterplot for house price, Source: Author

The line of best fit.

Liner regression aims to find a straight line that minimizes the overall differences between actual values and predicted values. This line represents the relationship between the dependent and independent variables. It does this through the linear Equation.

The graph below visualizes this. The blue dots are our observations and the red line is the result of the regression model (our predictions).

The linear equation

Linear regression is a parametric model. This means that it relies on specific parameter settings to perform well.

The Linear Equation

When we say “training” a model, we are looking for the parameters (β₀, β₁) that transform any input X to an output Y with the minimum amount of error (ϵ ).

Y: Represents dependent variables (the housing prices in the example). It represents an array (vector) of observed prices. In other words, it is not a single value.
X: Represents our independent variables or (sqm meters in the example). X also represents an array of square meters and is not a single value.
β₀: The intercept of the line. This value represents where the line begins i.e. what value would Y be if our X is equal to zero.
β₁: The slope of the line i.e. how does Y change when X changed.
ϵ: The residual or error. The collective differences between predictions and actual observations.

In the current scenario, we are looking for values β₀ and β₁, such that we can transform any square meter into a price.

Ordinary Least Squares(OLS).

To find β₀, β₁ we need to minimize the error. Specifically, we need to minimize the collective error between the predicted and actual values of our training data. The error is represented by the green vertical lines in the figure below.

Linear regression with residuals, Source: Author

One effective technique for achieving this is Ordinary Least Squares or in short OLS. Commonly, it does this by minimizing the residual sum of squares (RSS).

The Residual Sum of Squares (RSS)

The RSS represents a number that is the squared difference between the observed values and the predicted values. In essence, the sum of the green lines squared. The errors are squared to prevent negative and positive cancelations. Mathematically, the RSS is denoted as follows:

∑: (Sigma): means the sum
i: represent each observation
n: represents the last observation or simply where to stop.
(yᵢ — (β₀ + β₁xᵢ))²: Represent the squared differences between observations outcomes (yᵢ) and our predictions (β₀ + β₁xᵢ).

The last term is therefore summed (∑) for every observation(i) until we reach (n). If you are familiar with for loops in Python, this is essentially a for loop.

To ensure that β₀ and β₁ have values that result in the lowest possible RSS, we can use the following equations.

Finding β₁

Mathematically, we can find β₁ through the following equation:

While this might appear complex initially, it is quite straightforward. For the numerators, we first calculate the difference between our square meters and the average square meters (X — X̄). Next, calculate the differences between our prices and the average price(Y — Ȳ). Then we multiply the two. We do this for each observation and sum the results.

Below is a demonstration of how this process works using the simplified dataset.

Find β₁ getting the numerator, Source Author.

For the denominator, we square the difference between the square meters and then sum the square differences for each observation.

Find β₁ getting the Denominator, Source Author.

The β₁ in our example would be 396.222.393 / 206.582 = 1917.99

Finding β₀

Mathematically we can find β₀ so that the RSS is the lowest through the following equation:

Finding B0

Since we have found β₁, determining β₀ is simple. We multiply β₁ by the average of X and subtract it from the average Y. Using the example in this calculation yields, β₀ = 672.376,86–1917*287 = 122193.72

Evaluations metrics

As you may recall supervised machine learning splits the data into training data and test data. Once parameters are defined, the training of the model is completed. After which the linear equation is applied to test data.

Building on the previous example, our linear equation derived from the data: Y = 122193.72 + 1917.99 * X. To evaluate our model, our test data will be input into “X”.

Evaluation metrics exist to evaluate how well the model performs.

The Mean Squared Error (MSE)

The mean squared error is a measure that quantifies the variation in a model. Differently from RSS the MSE is not used to fit a line, but to measure the differences between the model’s predictions and the actual outcomes of the test data. The calculation of MSE closely resembles that of RSS.

The MSE on its own does not offer a lot of insight. However, it serves to compare the results with the outcomes of different models. Where the lowest MSE is considered the best.

The Root Mean Squared Error

The MSE is not very intuitive as it uses squared differences. However, the Root Mean Squared Error (RMSE) serves the same purpose as the MSE but more intuitively since it takes the square root of the MSE. Therefore, it is expressed in the units of your target variable.

RMSE, created using Equatio

R squared (R2)

The R squared or coefficient of determination is a statistical measure, expressed as a value between 0 and 1. It determines how well the variation in the dependent variables can be described by the variation in the predictors. In other words, it quantifies the strength of the relationship.

The TSS represents the total sum of squared. Which is the sum of all squared differences between the mean of our dependent variables and our dependent variables.

Multiple Linear regression

The main focus of this article is simple linear regression, which involves one independent variable. However, linear regression can be extended to include multiple independent variables.

Expanding the analyses with multiple variables that affect the dependent variables adjusts the linear equation as follows.

Linear Equation for Multiple Independent Variables

Visually, it becomes more challenging. Instead of fitting a line, we are fitting a plane. For 2 independent variables, it is still possible to visualize beyond that it becomes practically impossible.

In the example below the number of police incidents is added as an independent variable representing the frequency of incidents in the residential neighborhood of each house.

3D Regression with 2 independent variables. Source Author

Linear regression in Python

Python provides several libraries for linear regression, one commonly used library is Scikit-Learn.

In this example, we aim to predict the price of cryptocurrency Ethereum. It is well known that cryptocurrencies exhibit strong correlations. Therefore, we will use the Bitcoin price as our independent variable. The data is fetched using the Yahoo Finance Library.

First, we import our required libraries in Python.

import pandas as pd
import yfinance as yf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Following, we import our data using the Yahoo Finance Library.

#Get bitcoin data
BTC_Ticker = yf.Ticker("BTC-EUR") 
#Get Eth data
ETH_Ticker = yf.Ticker("ETH-EUR")

#Get the maxium period available for each ticker
BTC_Data = BTC_Ticker.history(period="max").reset_index()
ETH_Ticker = ETH_Ticker.history(period="max").reset_index()
#Print the first 5 rows of the the Data set 
ETH_Data.head()

The data shows each day’s open, highest, lowest, and closing prices. The closing prices of each day will be used in this example. Thus, the price of Ethereum or Bitcoin at the end of the day.

The code blocks below merge the datasets on the same date, ensuring that we have both the Bitcoin price and Ethereum price recorded for the same day in one dataset.

#Ensure that bitcoin data has the same date range as ETH 
BTC_Data = BTC_Data[
    (BTC_Data['Date'] >= BTC_min_date) & 
    (BTC_Data['Date'] <= BTC_max_date)
    ]

# Merge the data sets on Date 
BTC_ETH = BTC_Data.merge(
    ETH_Data, 
    left_on='Date', 
    right_on='Date', 
    suffixes=('_BTC', '_ETH'))

# show first 5 rows of the data set 
BTC_ETH[['Date', 'Close_ETH', 'Close_BTC']].head()

We aim to predict the ETH price making it our dependent variable (y). The independent variable (X) is the BTC price. In the following code block, we will split our data set into training data and test data.

Thereafter, we will fit BTC price training data and ETH price training data into the model and print the (β₀, β₁).

#Apply Linear Regression to predict the price of ETH using the price of BTC
X = BTC_ETH[['Close_BTC']]
y = BTC_ETH['Close_ETH']

#Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
    )
#Linear model
model = LinearRegression()
model.fit(X_train, y_train)
#Predictions
y_test_pred = model.predict(X_test)
#Print first 5 predictions
print(y_test_pred[:5])
#Print alpha, beta and the error
print("B1: ", model.coef_[0])
print("B0: ", model.intercept_)

The y_train_pred variable contains a list of all the predictions made on the training data. For our linear model, β₁ is ~0.065, and β₀ is ~ -154.60.

In the next code block, we will fit our model to our test data and evaluate the results. After which the results on our test data are visualized.

#Apply on training test data and visualize 
y_test_pred = model.predict(X_test)

#Apply on training test data and visualize 
y_test_pred = model.predict(X_test)

#Print the error
print("Error: ", mean_squared_error(y_test, y_test_pred))
#print root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_test_pred)))
#print the r2 score
print("R2: ", model.score(X_test, y_test))
#Plot the data
plt.figure(figsize=(10, 6))
sns.scatterplot(X_test['Close_ETH'], y_test, color='blue', label='Test Data')
sns.lineplot(X_test['Close_ETH'], y_test_pred, color='red', label='Linear Model (Predictions)')

The RMSE tells us that on average distance between the model and the actual observation is 357 euros. Considering the scale of the ETH price range (0 and 4000 euros), the model performs generally well. However, keep in mind that in typical practice one would compare the results of several models to see which one yields the lowest MSE.

The graph shows that the model captures the overall trend. However, it also shows occasional significant deviations between the model and the actual values. This is inherent in cases where there are dispersed points and a linear model is applied.

The limitations of linear regression

A disadvantage of linear regression is poor performance where variables have non-linear relationships. In these cases, a linear model has low predictive power. Furthermore, the technique is unsuitable for predicting categorical values. Additionally, the technique is sensitive to outliers which can lead to biased estimations.

Finally, it is crucial to recognize that machine learning models do not analyze causation but rather correlation.

Sources:

Schneider A, Hommel G, Blettner M. Linear regression analysis: part 14 of a series on evaluation of scientific publications. Dtsch Arztebl Int. 2010 Nov;107(44):776–82. doi: 10.3238/arztebl.2010.0776. Epub 2010 Nov 5. PMID: 21116397; PMCID: PMC2992018.

Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. https://doi.org/10.1017/9781108679930

Wikipedia contributors. (2024, February 13). Linear regression. Wikipedia. https://en.wikipedia.org/wiki/Linear_regression

Schneider, A., Hommel, G., & Blettner, M. (2010). Linear Regression analysis. Deutsches Arzteblatt International. https://doi.org/10.3238/arztebl.2010.0776

Bakar, N. M. A., & Tahir, I. M. (2009). Applying multiple linear regression and neural network to predict bank performance. International Business Research, 2(4). https://doi.org/10.5539/ibr.v2n4p176

Panay, B., Baloian, N., Pino, J. A., Peñafiel, S., Frez, J., Fuenzalida, C., Sanson, H., & Zurita, G. (2021). Forecasting key retail performance indicators using interpretable regression. Sensors, 21(5), 1874. https://doi.org/10.3390/s2