Regression: A Step-by-Step Code Walkthrough

6 min readJul 11, 2023

In this code walkthrough, I have taken inspiration from a remarkable book which is “Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow” to present a comprehensive explanation.

For an in-depth and comprehensive understanding of linear and multiple regression, I highly recommend referring to my previous blog posts Which are Multiple Regression without iteration (Mathematical Intuition) and Linear Regression without iteration (Mathematical Intuition). These articles provide detailed explanations that delve deep into the intricacies of these regression techniques, offering valuable insights and highlighting key concepts.

Let’s dive into the code

IMPORTING DATA: -

There are two methods for importing data: static and dynamic. Static imports data from your local directory, whereas dynamic imports data from an open source, which is ideal when numerous developers are working on the same data.

Static method:

By clicking the button below in Google Colab, you can upload data that has been downloaded from open data repositories like Kaggle or AWS datasets.

and run the code below to read data.

import pandas as pd
dataset = pd.read_csv("/MultipleRegression.csv")
print(dataset)

Dynamic method:

You can read directly import from Kaggle or GitHub through link as shown in below.

import os
import tarfile
import urllib

Root = "https://raw.githubusercontent.com/NithishaRaghavaraju/ML_Datasets/master/"
Path = os.path.join("Housing")
URL = Root + "/MultipleRegression.csv"

housing_url=URL
housing_path=Path
os.makedirs(housing_path, exist_ok=True)
tgz_path = os.path.join(housing_path, "Regression.csv")
urllib.request.urlretrieve(housing_url, tgz_path)

import pandas as pd
def load_dataset(housing_path=Path):
  path = os.path.join(housing_path,"Regression.csv")
  return pd.read_csv(path)

dataset = load_dataset()
dataset.head()

NOTE :- The MultipleRegression data consists of all car details such as carlength, carwidth, stroke, price….. and price is to be estimated.

ANALYSING DATA:-

To see which attributes are categorical and which are numerical, print the code below.

dataset.info()

remove unnecessary attributes.

dataset = dataset.drop(["ID","symboling","name","enginetype"],axis=1)

The output table of below code is self-explanatory. The standard deviation measures the dispersion of values. The 25%, 50%, and 75% rows display the corresponding percentiles, indicating the values below which a given percentage of observations in a group of observations fall. For example, 25% of districts have a housing median age below 18, 50% below 29, and 75% below 37. These percentiles are commonly referred to as the 25th percentile, median, and 75th percentile.

For example, 25% of cars have prices below 7788, 50% below 10295, and 75% below 16503. These percentiles are commonly referred to as the 25th percentile, median, and 75th percentile.

dataset.describe()

Output:-

Run the code below to see the attribute values as a histogram; bins show how many subparts there are in each histogram, and figsize shows how big each one is.

import matplotlib.pyplot as plt
dataset.hist(bins=50,figsize=(20,15))
plt.show()

From this output, you can determine how values are distributed, such as some attribute values are distributed more widely to the right than to the left or vice versa. You can scale them to give them a bell-shaped distribution.

SPLITTING DATA:-

By dividing the data, we can later test how well our model performs with new data.

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(dataset, test_size=0.2, random_state=1)

Now will look at correlation between attributes.

corr_matrix = dataset.corr()
corr_matrix["price"].sort_values(ascending=False)

Output:-

From this output we can conclude that engine size is an important attribute, so let’s carefully split that so that the train data and the test data both have significant values.

dataset["enginesize"].hist(bins=50,figsize=(10,5))\
plt.show()

Most of the values are between 80 and 150 so let’s make category attribute with five categories. category 1 ranges from 0 to 80 , category 2 from 80 to 100, and so on:

import numpy as np
dataset["enginesize_cat"] = pd.cut(dataset["enginesize"],bins=[0, 80, 100, 120, 150, np.inf],labels=[1, 2, 3, 4,5])
dataset["enginesize_cat"].hist()

output:

Spliting the data so that the significant enginesize values are present in both the train and test data

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(dataset, dataset["enginesize_cat"]):
 strat_train_set = dataset.loc[train_index]
 strat_test_set = dataset.loc[test_index]

for set_ in (strat_train_set, strat_test_set):
 set_.drop("enginesize_cat", axis=1, inplace=True)
dataset = strat_train_set.copy()

HANDLING BOTH NUMERICAL AND CATEGORICAL DATA:-

Separate the numerical and categorical data so that you can handle them separately.

#seperating labels and features
dataset_features = strat_train_set.drop("price",axis=1)
dataset_labels = strat_train_set["price"]

dataset_features_num = dataset_features
dataset_features_cat = dataset_features
for i in dataset_features.columns:
  value = dataset_features[i][0]
  if type(value) == str:
    dataset_features_num = dataset_features_num.drop([i],axis=1)
  else:
    dataset_features_cat = dataset_features_cat.drop([i],axis=1)

Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. The most common method to do that is one-hot encoding. (Refer this https://youtu.be/9yl6-HEY7_s youtube video to get to know more about one-hot-encoding.)

Apply feature scaling now to numerical data to ensure a bell-shaped distribution of values while avoiding a tail-heavy distribution.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('std_scaler', StandardScaler()),
 ])
num_attribs = list(dataset_features_num)
cat_attribs = list(dataset_features_cat)
full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs),
 ])
dataset_prepared = full_pipeline.fit_transform(dataset_features)

TRAINING AND EVALUATING ON DIFFERENT ML MODELS:-

Model 1:- LINEAR REGRESSION

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(dataset_prepared, dataset_labels)

#calculating mean squared error
from sklearn.metrics import mean_squared_error
dataset_predictions = lin_reg.predict(dataset_prepared)
lin_mse = mean_squared_error(dataset_labels, dataset_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
#OUTPUT :- 1983.9432044032753

from sklearn.metrics import r2_score
r2_score(dataset_labels,dataset_predictions)
#OUPUT :- 0.9378733314555263

93% accuracy is good but let’s also check how it works on validation data

def display_scores(scores):
  print("Scores:", scores)
  print("Mean:", scores.mean())
  print("Standard deviation:", scores.std())

from sklearn.model_selection import cross_val_score
scores = cross_val_score(lin_reg, dataset_prepared, dataset_labels,scoring="neg_mean_squared_error", cv=10)
Lin_rmse_scores = np.sqrt(-scores)

display_scores(Lin_rmse_scores)
#OUTPUT :- Scores: [2400.80093095 1658.19418067 2416.35394522 2840.92269957 2402.326103043604.18389792 2276.06538738 1673.08553587 2959.94044758 5687.91915051] Mean: 2791.979227870859 Standard deviation: 1110.23664076687

as you see the scores on validation set is quite bad compare to training set.

Model 2:- DECISION TREE

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(dataset_prepared, dataset_labels)

dataset_predictions = tree_reg.predict(dataset_prepared)
tree_mse = mean_squared_error(dataset_labels, dataset_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
#OUTPUT:- 303.6285338787609

from sklearn.metrics import r2_score
r2_score(dataset_labels,dataset_predictions)
#OUTPUT:- 0.9985448600623227

99% accuracy is excellent but it can be overfitting also so let’s check on validation set.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, dataset_prepared, dataset_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

display_scores(tree_rmse_scores)

#OUTPUT:- Scores: [3606.0749723  3274.85132352 2368.85818859 2706.39157421 2232.300141534152.46041522 1881.75807093 2165.70824515 3789.15809415 2598.74148928] Mean: 2877.630251489709 Standard deviation: 736.5790641801335

as you see the scores on validation set is bad compare to training set still it works better than linear regression.

Model 2:- RANDOM FOREST

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(dataset_prepared, dataset_labels)

dataset_predictions = forest_reg.predict(dataset_prepared)
forest_mse = mean_squared_error(dataset_labels, dataset_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
#Output:- 855.542197109679

from sklearn.metrics import r2_score
r2_score(dataset_labels,dataset_predictions)
#Output:- 0.9884467953899244

98% accuracy is excellent but it can be overfitting also so let’s check on validation set.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(forest_reg, dataset_prepared, dataset_labels,scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores)

display_scores(forest_rmse_scores)
#OUTPUT:- Scores: [2734.44328867 1188.14002497 1838.13175821 3073.10865759 2771.873636812327.09898295 1398.38354161 2146.56526885 2450.01392878 2253.57101669] Mean: 2218.1330105124134 Standard deviation: 570.3932128584376

Comparing Decision tree and linear regression, Random Forest works well on validation set

TESTING THE MODEL:-

X_test = strat_test_set.drop("price", axis=1)
y_test = strat_test_set["price"].copy()
X_test_prepared = full_pipeline.transform(X_test)

#TESTING USING LINEAR REGRESSION
final_predictions = lin_reg.predict(X_test_prepared)
from sklearn.metrics import r2_score
r2_score(y_test,final_predictions)
  #output:- 0.7529563908162089

#TESTING USING DECISION TREE
final_predictions = tree_reg.predict(X_test_prepared)
from sklearn.metrics import r2_score
r2_score(y_test,final_predictions)
  #output:- 0.8306763271550477

#TESTING USING RANDOM FOREST
final_predictions = forest_reg.predict(X_test_prepared)
from sklearn.metrics import r2_score
r2_score(y_test,final_predictions)
  #output:- 0.9138544598709686

According to the above code, RANDOM FOREST has a superior accuracy rate of 91%. Therefore RANDOM FOREST is the ideal model for this data.

Refer the below github link to get the complete code

GITHUB LINK : https://github.com/NithishaRaghavaraju/ML_REGRESSION

Regression: A Step-by-Step Code Walkthrough

Written by InnovationHub