Implementing Linear Regression on California Housing Dataset

Debarshi Raj Basumatary
4 min readJul 17, 2023

In this article, I will walk you through basic linear regression implementation using python scikit-learn.

Photo by Chris Ried on Unsplash

Make sure, you have the required packages
1. Pandas
2. Numpy
3. Scikit-Learn
4. Matplotlib

We will use the California Housing Data from scikit-learn to predict the median house value.

Step 1: Import all the packages.

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Step 2: Fetching California housing data.

housing = fetch_california_housing(as_frame=True)
housing = housing.frame
housing.head()

Step 3: Visualising the Data.

housing.hist(bins=50, figsize=(12,8))
plt.show()

From the above histograms of the different features, we can conclude that:
1. Features are distributed on very different scales
2. In HouseAge and HouseValue columns the values are capped at 50 and 5 respectively.

For better accuracy, we should preprocess those features. We can either perform feature engineering or clean those problematic instances.

Now we plot the housing value with respect to longitude and latitude i.e based on location.

housing.plot(kind="scatter", x="Longitude",y="Latitude", c="MedHouseVal", cmap="jet", colorbar=True, legend=True, sharex=False, figsize=(10,7), s=housing['Population']/100, label="population", alpha=0.7)
plt.show()

The above plot displays the map of California, with the color map corresponding to house value and the radius of the circles corresponding to the population of the areas. Based on this plot, we can conclude that:
1. Houses near ocean value more.
2. House in high population density area also value more but the effect decreases as we move further away from the ocean.
3. And there are some outliers

Next, we will plot the correlation between the features against each other.

attributes = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup','MedHouseVal']
scatter_matrix(housing[attributes], figsize=(12,8))
plt.show()

If we check the correlation against MedHouseVal, we can see that all the other features show somewhat weak correlation, except for MedInc (Median Income). Let’s explore further

housing.plot(kind="scatter", x="MedInc",y="MedHouseVal")
plt.show()

The above plot shows a strong linear correlation between Median Income and House Value. Also we can see some instances in the data are capped and they form lines. This can create problem, they should be preprocessed.

Below code display the numerical value of correlation of features against “MedHouseVal”

corr = housing.corr()
corr['MedHouseVal'].sort_values(ascending=True)

As expected MedInc(Median Income) show a strong correlation.

Step 4: Validate Data.

housing.isna().sum()

There are no Null Values

housing.dtypes

All the features are float and there are no categorical feature.

Step 5: Split the dataset for testing and training

X = housing.iloc[:,:-1]
y = housing.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we are randomly splitting the data into training and testing set using train_test_split() method. 80% is kept for training and 20% for testing.

Step 6: Fitting the model

regression_pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
regression_pipeline.fit(X_train,y_train)

For better accuracy, standard scaling is applied. The Pipeline first applies the standardScaler() function to the features and then calls the Linear Regression Model. Using a Pipeline makes the code cleaner, reusable, and reduces a lot of boilerplate code.

Step 7: Prediction and Evaluation

y_pred = regression_pipeline.predict(X_test)
r2_score( y_test, y_pred)

The R2 score achieved is only 0.57, which is not very high. With feature engineering and preprocessing, I was able to improve the score to 0.62. However, I believe that this is the best that simple linear regression can achieve. To achieve a higher score, advanced regression regularization techniques or different algorithms, such as random forest, should be used.

Conclusion:

Here, I have just demonstrated how to perform basic linear regression using the Scikit-learn library. If you would like to experiment with other algorithms from the same library, go ahead, the implementation process remains the same.

Here the link for the full code, Notebook

Reference:

  1. Scikit-Learn Documentation
  2. Hands-on Machine Learning using Scikit-Learn and Tensorflow, Aurelien Geron

WRITER at MLearning.ai // Code Interpreter // AI’s Safe Deception

--

--

Debarshi Raj Basumatary

Software Engineer | Data Science & Machine Learning enthusiast.