Linear Regression

Barliman Butterbur
2 min readAug 10, 2023

--

Linear Regression is one of the most important machine learning algorithm. It assumes linear relationship between input and output data.

linear regression equation

Despite being one of the fundamental ML algorithm, it is not easy to find to the point and easy example of how to use it. Since it is a beginner’s topic, a lot of articles explain what is input data, correlation between various features, graphs etc. We will start with basic example first and expand upon that.

Here is an example of linear regression using California housing dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X, y = fetch_california_housing(data_home='data/', return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
lr = LinearRegression()
lr.fit(X=X_train, y=y_train)
lr.score(X_test, y_test)

Since we are using fixed random state (and assuming there are no changes to dataset) you should see following score

0.5757877060324524

Now let us add k-fold validation and see score across 10 different folds. Again, you should see same results since we are using fixed random state.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

X, y = fetch_california_housing(data_home='data/', return_X_y=True)
kf = KFold(n_splits=10, random_state=42, shuffle=True)
lr = LinearRegression()

for train_index, test_index in kf.split(X):
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
lr.fit(X=X_train, y=y_train)
print(lr.score(X_test, y_test))
0.5808353312067788
0.5701791688326285
0.6343601995053072
0.5945333015201937
0.6156131001308487
0.6026863788135097
0.5907293729511975
0.6398831518451857
0.5778060261606383
0.5941261976876657

We can also plot predicted vs actual test output to quickly get some idea about the accuracy.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import pandas as pd
import matplotlib.pyplot as plt

X, y = fetch_california_housing(data_home='data/', return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
lr = LinearRegression()
lr.fit(X=X_train, y=y_train)
lr.score(X_test, y_test)

y_predict = lr.predict(X_test)
df = pd.DataFrame({'Predicted':y_predict,'Actual':y_test})
plt.plot(df[:200])
plt.legend(['Predicted', 'Actual'])
plt.show()
Predicated vs Actual for first 200 datapoints

--

--