What Are Linear AI Models?
Linear regression explained, including scikit-learn (sklearn) implementation.
Linear Models
Linear models are parametric models that can solve linearly separable problems.
Table of Content
· Linear Models
∘ Table of Content
∘ Fitting a Straight Line
∘ Linear Regression
∘ Loss Function and Squared-Error Loss
∘ Predicting House Prices Using Linear Regression
∘ Solution
· Applicable Use Cases for Linear Regression
∘ Advantages
∘ Disadvantages
∘ Practical Examples
· Coding
Fitting a Straight Line
The simplest form of a linear model fits a straight line to a set of observations. Imagine looking at houses in an area and observing the prices compared to the area. We plot the size compared to the price in the graph below.
What would a reasonable price be for a 1600 sq ft home? There is no data for this. However, if we can draw a straight line through all the points, we can make a reasonable estimate.
Linear Regression
To accomplish this, we use linear regression. Remember that a straight line is described by the equation: y = kx + m
, where k is the slope and m is the y-intercept. To fit a line to the observations, we have to find the parameters for the slope and y-intercept that minimize the loss.
Loss Function and Squared-Error Loss
To do so, we will use the squared-error loss function (L2), which measures the distance from predictions to our actual data points. The loss function is defined as:
The squared-error loss function is convex, meaning there is a unique solution that minimizes the error. By taking the derivative of the loss function and setting it to zero, we find the unique equations for the parameters that minimize the loss:
Predicting House Prices Using Linear Regression
Now, we return to the previous problem of predicting house prices. To solve this, we begin by calculating the slope and y-intercept parameters. Since the m-parameter is dependent on k, we begin by calculating k. Given that we have N=5 observations.
N(Σxiyi) = 5(1500*300,000 + 1800*350,000 + 2000*400,000 + 2200*420,000 + 2500*450,000)
N(Σxiyi) = 19645000000
Σxi = 1500 + 1800 + 2000 + 2200 + 2500
Σxi = 10000
Σyi = 300,000 + 350,000 + 400,000 + 420,000 + 450,000
Σyi = 1920000
Σxi^2 = 1500^2 + 1800^2 + 2000^2 + 2200^2 + 2500^2
Σxi^2 = 20580000
k = N(Σxiyi) - (Σxi)(Σyi) / (N(Σxi^2) - (Σxi)^2)
k = (19645000000 - 10000*1920000) / (5*20580000 - 10000^2)
k = 153.448276
k = 153.448276
Continuing by calculating m.
m = (Σyi - k(Σyi)) / N
m = (1920000 - 153.448276*10000) / 5
m = 77103.448
Solution
Using linear regression, we have found the parameters for the slope and y-intercept that minimize the error. We have found that the equation: y = 153.448276x + 77103.448
fits our dataset with the least distance (error) to the observations. We continue by plotting this line.
Finally, let’s answer our question: What would a reasonable price be for a 1600-square-foot home?
y = 153.448276*1600 + 77103.448
y = 322,620.69
Answer: A 1600–square–foot home would cost approximately $322,620.
Applicable Use Cases for Linear Regression
Linear regression is just one tool in the enormous toolbox of artificial intelligence and machine learning algorithms. This section will list the advantages and disadvantages of using linear regression. Later, we will give a few practical examples.
Advantages
- Works well for approximate linearly separable problems.
- Simple to implement, interpret, and train.
Disadvantages
- Does not work for non-linear data.
- Sensitive to outliers.
- Prone to multicollinearity.
Practical Examples
- Predicting Housing Prices: In this article, we used this example, which works well since we have linearly separable data and few outliers. Linear regression is appropriate because it is easy to implement, interpret, and train.
- Predicting Customer Behaviour: Imagine a scenario where we have access to massive amounts of customer data and want to predict purchasing patterns. In this case, there is a high chance that some features are highly correlated (multicollinearity). And there may not be a presumed linear relationship between the variables. The model should not be used since linear regression is sensitive to multicollinearity and presumes a linear relationship exists between variables.
Coding
Calculating the linear regression lines by hand can become tedious, especially if you have a larger parameter space or more observations. Therefore, it is essential to use tools to speed up and ease the process.
Imagine the same problem, but now we have 1000 observations. The problem is visualized in the left graph below. Doing the calculations by hand would take a long time. However, using coding, we can solve the problem very quickly.
Using Pandas and sklearn, we can load the data and perform linear regression. We begin by loading and inspecting the data using Pandas.
import pandas as pd
data = pd.read_csv("data/houses.csv")
print(data)
Size Price
0 3674 701480.235685
1 1360 278590.638446
2 1794 358791.708572
3 1630 303157.220416
4 1595 264450.702912
.. ... ...
995 3677 622883.786597
We continue by reshaping the data and running linear regression.
from sklearn.linear_model import LinearRegression
sizes = data["Size"].values.reshape(-1, 1)
prices = data["Price"]
model = LinearRegression()
model.fit(sizes, prices)
k = model.coef_[0]
m = model.intercept_
print(f"The slope {k=}, the y-intercept {m=}")
The slope k=153.7676635907488, the y-intercept m=79758.3074173322
We finally plot the linear regression line using Matplotlib.
import matplotlib.pyplot as plt
import numpy as np
x = np.array([500, 4000])
y = k * x + m
plt.figure(figsize=(10, 6))
plt.scatter(sizes, prices, color="blue", label="Data Points")
plt.plot(x, y, color="red", linestyle="--", label=f"Regression Line: y = {k:.2f}x + {m:.2f}")
plt.title('House Prices Based on Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()
Further Reading
If you want to learn more about programming and, specifically, machine learning, see the following Coursera course:
Note: If you use my links to order, I’ll get a small kickback.