Linear Regression lab for dummies [in Python]

Mahsa Mir
The Startup
Published in
8 min readJun 26, 2020

Today we’ll learn how to use linear regression to assist businesses with decision-making. So here we go 🤓

Understanding Machine Learning through Memes

Road-map:
1- Business understanding: understanding the problem you are solving and define a goal
2- Data understanding: difference between types of data
3- Data preparation: handling missing values/Data Scaling
4- Data Modeling:
- Simple/Multiple Linear Regression
- Polynomial Regression
5- Data evaluating: is the model doing a good job with test data?

Step 1: business understanding
- Specify the key variables that will be used as the model
- Define related metrics
- Business understanding must be SMART! complaint:
S
pecific/Measurable/Achievable/Relevant/Time-bound

Step 2: Data understanding (Data Type)
It is really important to know what type of data we are working with:

1- Numerical:
- Expressed in numbers and has measurement meaning
- broken down into interval and ratio data
-> Interval: house’s temp. (23C,18C)
-> Ratio: house’s size (1200 sf.,1020 sf.)

2- Categorical:
- Represent types of data divisible into groups
- broken down into Nominal and Ordinal data
- > Nominal: houses’s color (Red, Green)
- > Ordinal: door’s number (#1201,#1208)

Real life example: Think about houses in a neighborhood
Different Data Type vs Operations

Step 3: Data preparation
Now quit with the theory and let’s dive into practice ^__*

- Source Data Set: from Kaggle(Insurance)

Importing our Python libraries, Data-Set, and get some information.

#importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot
#importing Data
Data = pd.read_csv('DataSet_Kaggle_Insurance.csv')
Data.head()
#get some information about our Data-Set
Data.info()
Data.describe()
#visualizing data
sns.pairplot(Data['age sex bmi children charges'.split()])
sns.heatmap(Data['age sex bmi children charges'.split()].corr(), annot=True,)
pair-plot — seaborn
heatmap — seaborn

Handling of Missing Data

Before applying any method, we need to check how many values are missing and find the way to handle them. You will choose one of the below depending on your Data-Set (i.e. here I picked “Fill NaN values”).

  • Drop the column where NaN exists
  • Drop the row where NaN exists
  • Fill the NaN values (the most common idea)
#check how many values are missing (NaN)
Data.isnull().sum()
#fill the missing values (NaN) by mean of the column
Data['bmi'].fillna(Data['bmi'].mean(), inplace = True)
#double check how many values are missing (NaN) after filling
Data.isnull().sum()
Before and after filling NaN values

Handling of Categorical Data

Since machine learning models are based on mathematical equations, we need to encode the categorical variables.

Opt. 1- label encoding: when there are two distinct values
in our Data-Set: Sex(Male/Female)- Smoker (Yes/No)

Opt. 2- On Hot Encoding: when there are three or more distinct values
in our Data-Set: Region(southwest/southeast/northwest/northeast)

#importing required libraries
from sklearn.preprocessing import LabelEncoder
#using Label Encoder for converting sex and smoker columns
labelencoder = LabelEncoder()
Data['sex'] = labelencoder.fit_transform(Data['sex'])
Data['smoker'] = labelencoder.fit_transform(Data['smoker'])
Data.head()
#importing required library
from sklearn.preprocessing import OneHotEncoder
#using One-Hot Encoding for converting region column
ohe = OneHotEncoder()
ohe_Data = pd.DataFrame(ohe.fit_transform(Data[['region']]).toarray())
ohe_Data.columns = 'northeast northwest southeast southwest'.split()
#merge main Data with Ohe_Data
Data = Data.join(ohe_Data)

Splitting the Data-Set into Training Set and Test Set

Data is divided into Training set and Test set, we use the Train set to make the algorithm learn the data’s behavior and then we’ll check the accuracy of our model on the Test set.

Features(X): the columns that are inserted into our model will be used to make predictions.
Prediction(y): target variable that will be predicted by the features

#define X variables and our target(y)
X = Data.drop(['charges','region'],axis = 1)
y = Data['charges']
#splitting Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

Feature Scaling

Feature scaling will help us to see all the variables from the same lens (same scale). It can be done in two ways:
Opt. 1- Normalization:
-> Rescales data into a range of [0,1]
-> The outliers from the data set are lost

Opt. 2- Standardization (more recommended)
-> Rescales data to have a mean of 0 and standard deviation of 1
-> Less effected by outliers

#normalized scaler - fit&transform on train, fit only on test
from sklearn.preprocessing import MinMaxScaler
n_scaler = MinMaxScaler()
X_train = n_scaler.fit_transform(X_train.astype(np.float))
X_test = n_scaler.transform(X_test.astype(np.float))
#standardization scaler - fit&transform on train, fit only on test
from sklearn.preprocessing import StandardScaler
s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))

Step 4: Data Modeling

1- Supervised Learning Prediction
-
Regression: predict a numerical variable ( which is covered in this article)
-
Classification: predict a categorical variable (will be covered soon)

2- Unsupervised Learning Prediction
-
Clustering: discover the inherent groupings in the data
-
Association: discover rules that describe portions of the data

IMPORTANT

It is really easy to take a action from the results of linear regression, but keep in mind that we need to check some assumptions before and after using any Regression model, if our assumptions are not satisfied, our result could not make sense. You can find a very useful reference from jeff Macaluso here

Linear Regression

Overall, the purpose of a regression model is to understand the relationship between features and target. below theory helped me understand the concept of Regression so let’s review it again (it won’t be boring I promise):

Data Science Training — Kiril Eremenko
Data Science Training — Kiril Eremenko

Question: how does employees’ salary depends on their experience?

  • Regression model: find a line that best fit the data (we will talk about this concept later)
  • Constant: the point that the fit line cross the vertical axis. i.e. when the experience is zero, salary is about 30K.
  • Coefficient (line slop) If somebody gets 1 year experience, his/her salary will raise by 10K, hence if the coefficient is bigger the experience has more affect on salary and vice-versa.
Data Science Training — Kiril Eremenko
  • Red cross: actual salary
  • Green cross: predicted salary
  • Model line (black line) tells us where that person should be in terms of salary(green cross)
  • Residual(Green line): Difference between what is actually earned and the predicted earning
  • What Regression does: it draws several possible trend lines and counts the sum of square of the difference every time, records them somewhere then picks the for the minimum sum of these squares which represents the best fit line.

Simple Linear Regression

Simple linear regression involving two variables:

#get one X variable and our target(y)
X = Data['bmi'].values.reshape(-1,1)
y = Data['charges'].values.reshape(-1,1)
#splitting Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
#Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)

Result should be approximately 13201.18 and 2192.51 respectively.
which means for every one unit of change in bmi feature, the charge is about 21%, now it’s time to make our predictions:

#predicting the test set result
y_pred = regressor.predict(X_test)
#compare actual output values with predicted values
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
# visualize comparison result as a bar graph
df1 = df.head(20)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
#prediction vs test set
plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()
Actual and predicted values — bar graph

Well, our model is not very precise but predicted percentages are close to the actual ones! here we plot regression line with test set:

prediction vs test set

Now we need to complete our model evaluation, we will calculate:

  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
# evaluate the performance of the algorithm
from sklearn import metrics
#(MAE):mean of the absolute value of the errors
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
#(MSE) is the mean of the squared errors
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
#(RMSE): square root of the mean of the squared errors
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

by comparing the value of root mean squared error and mean value of our target, we can see that our algorithm is accurate or not.

Multiple Linear Regression

Here, we have more than two features, the steps to perform multiple linear regression are almost similar to simple linear regression:

#get one X variable and our target(y)
X = Data.drop(['charges', 'region'],axis = 1)
y = Data['charges']
#splitting Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)#standardization scaler - fit&transform on train, fit only on test
from sklearn.preprocessing import StandardScaler
s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))
#Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)
#predicting the test set result
y_pred = regressor.predict(X_test)
#put results as a DataFrame
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
#checking accuracy of Model
print('Linear Regression Model:')
print("Train Score {:.2f}".format(regressor.score(X_train,y_train)))
print("Train Score {:.2f}".format(regressor.score(X_test, y_test)))

below table means for one unit increase in age variable, there is a increase about 3.59 units in target (charges)

Most optimal coefficient for all features found by Regression Model

Let’s compare actual output and predicted value:

#compare actual output values with predicted values
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(10)
# evaluate the performance of the algorithm (MAE - MSE - RMSE)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Polynomial Regression

# PolynomialFeatures
from sklearn.preprocessing import PolynomialFeaturesX = Data.drop(['charges','region'],axis = 1)
y = Data['charges']
#training the simple Linear Regression model on the training set
poly = PolynomialFeatures (degree = 2)
X_poly = poly.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X_poly,y, test_size = 0.33, random_state = 0)
#standard scaler (fit transform on train, fit only on test)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float))
X_test= sc.transform(X_test.astype(np.float))
#fit and predict model
poly_lr = LinearRegression().fit(X_train,y_train)
y_pred = poly_lr.predict(X_test)
#checking accuracy of Polynomial Regression Model
print('Polynomial Regression Model:')
print("Train Score {:.2f}".format(poly_lr.score(X_train,y_train)))
print("Test Score {:.2f}".format(poly_lr.score(X_test, y_test)))
#evaluate the model - Coefficient and constant
print(poly_lr.intercept_)
print(poly_lr.coef_)
#compare actual output values with predicted values
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(10)
df1
# evaluate the performance of the algorithm (MAE - MSE - RMSE)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

This was a handy guide which I hope helped you understand better as much as it did for me. thanks for reading 🤓

--

--

Mahsa Mir
The Startup

Eager learner l a big believer in saving time l linkedin: mahsamirgholami