## Data Science

# Linear Regression Basics for Absolute Beginners

## Tutorial on simple and multiple regression analysis using NumPy, Pylab, and Scikit-learn

# 1. Introduction

Regression models are the most popular machine learning models. Regression models are used for predicting target variables on a continuous scale. Regression models find applications in almost every field of study, and as a result, it is one of the most widely used machine learning models. This article will discuss the basics of linear regression and is intended for beginners in the field of data science.

Using the cruise ship dataset **cruise_ship_info.csv**, we will demonstrate simple and multiple regression analysis using NumPy, Pylab, and Scikit-learn. Because this is just an introductory tutorial, no distinction between inliers and outliers shall be made (outliers can be handled using more robust methods such as the RANSAC regression).

# 2. Data Analysis

## 2.1 Import Necessary Libraries

import numpy as npimport pandas as pdimport pylabimport matplotlib.pyplot as pltimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.metrics import r2_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import Pipelinepipe_lr = Pipeline([('scl', StandardScaler()),

('lr', LinearRegression())])

## 2.2 Read dataset and display columns

df = pd.read_csv("cruise_ship_info.csv")df.head()

## 2.3 Calculate the covariance matrix

cols = ['Age', 'Tonnage', 'passengers', 'length',

'cabins','passenger_density','crew']

stdsc = StandardScaler()X_std = stdsc.fit_transform(df[cols].iloc[:,range(0,7)].valuescov_mat = np.cov(X_std.T)

## 2.4 Generate a heatmap for visualizing the covariance matrix

`plt.figure(figsize=(10,10))sns.set(font_scale=1.5)hm = sns.heatmap(cov_mat,`

cbar=**True**,

annot=**True**,

square=**True**,

fmt='.2f',

annot_kws={'size': 12},

yticklabels=cols,

xticklabels=cols)

plt.title('Covariance matrix showing correlation coefficients')

plt.tight_layout()

plt.show()

# 3. Simple Linear Regression

In simple linear regression, there is only one predictor variable. Since our goal is to predict the crew variable, we see from **Figure 1** that the cabins variable correlates the most with the crew variable. Hence our simple regression model can be expressed in the form:

where m is the slope or regression coefficient, and c is the intercept. The model will be evaluated using the R2 score metric which can be calculated as follows:

The R2 score takes values between 0 and 1. When R2 is close to 1, it means the predicted values agree closely with the actual values. If R2 is close to zero, then it means the predictive power of the model is very poor.

Let’s now define and plot our independent and dependent variables:

X = df['cabins']y = df['crew']plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)plt.xlabel('cabins')plt.ylabel('crew')plt.title('scatter plot of crew vs. cabins')plt.show()

## 3.1 Simple linear regression using numpy

z = np.polyfit(X,y,1)p = np.poly1d(z)print(p)

**Output**: 0.745 x + 1.216

This shows that the fitted slope is m = 0.745, and the intercept is c = 1.216.

y_pred_numpy = p(X)R2_numpy = 1 - ((y-y_pred_numpy)**2).sum()/((y-y.mean())**2).sum()print(R2_numpy)

**Output**: R2_numpy = 0.9040636287611352

`print(r2_score(y, y_pred_numpy))`

**Output**: 0.9040636287611352

Let’s now plot the actual and predicted values:

plt.figure(figsize=(10,7))plt.scatter(X,y,c='steelblue', edgecolor='white', s=70,

label='actual')plt.plot(X,y_pred_numpy, color='black', lw=2, label='predicted')plt.xlabel('cabins')plt.ylabel('crew')plt.title('actual and fitted plots')plt.legend()plt.show()

## 3.2 Simple linear regression using Pylab

degree = 1model= pylab.polyfit(X,y,degree)print(model)

**Output**: array([0.7449974 , 1.21585013]). We see again that the slope is m = 0.745, and the intercept is c = 1.216.

y_pred_pylab = pylab.polyval(model,X)R2_pylab = 1 - ((y-y_pred_pylab)**2).sum()/((y-y.mean())**2).sum()print(R2_pylab)

**Output**: R2_pylab = 0.9040636287611352

`print(r2_score(y, y_pred_pylab))`

**Output**: 0.9040636287611352

## 3.3 Simple linear regression using scikit-learn

lr = LinearRegression()lr.fit(X.values.reshape(-1,1),y)print(lr.coef_)print(lr.intercept_)

**Output**: [0.7449974], 1.2158501299368671. We see again that the slope is m = 0.745, and the intercept is c = 1.216.

y_pred_sklearn = lr.predict(X.values.reshape(-1,1))R2_sklearn = 1 - ((y-y_pred_sklearn)**2).sum()/((y-y.mean())**2).sum()print(R2_sklearn)

**Output**: R2_sklearn = 0.9040636287611352

`print(r2_score(y, y_pred_sklearn))`

**Output**: 0.9040636287611352

We observe that all 3 methods for basic linear regression (NumPy, Pylab, and Scikit-learn) gave consistent results.

# 4. Multiple Linear Regression with Scikit-Learn

From the covariance matrix plot above (**Figure 1**), we see that the “crew” variable correlates strongly (correlation coefficient ≥ 0.6) with 4 predictor variables: “Tonnage”, “passengers”, “length, and “cabins”. We can, therefore, build a multiple regression model of the form:

where X is the features matrix, w_0 is the intercept, and w_1, w_2, w_3, and w_4 are the regression coefficients.

## 4.1 Define features matrix and the target variable

cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']df[cols_selected].head()X = df[cols_selected].iloc[:,0:4].values # features matrix y = df[cols_selected]['crew'].values # target variable

## 4.2 Model building and evaluation

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)sc_y = StandardScaler()y_train_std = sc_y.fit_transform(y_train[:,np.newaxis]).flatten()pipe_lr.fit(X_train, y_train_std)y_train_pred = sc_y.inverse_transform(pipe_lr.predict(X_train))y_test_pred = sc_y.inverse_transform(pipe_lr.predict(X_test))r2_score_train = r2_score(y_train, y_train_pred)r2_score_test = r2_score(y_test, y_test_pred)print('R2 train for lr: %.3f' % r2_score_train)print('R2 test for lr: %.3f ' % r2_score_test)

**Output:**

R2 train for lr: 0.912

R2 test for lr: 0.958

## 4.3 Plot the output

plt.scatter(y_train, y_train_pred, c='steelblue', edgecolor='white', s=70, label='fitted')plt.plot(y_train, y_train, c = 'red', lw = 2,label='ideal')plt.xlabel('actual crew')plt.ylabel('predicted crew')plt.legend()plt.show()

# 5. Summary

In summary, we’ve presented a tutorial on simple and multiple regression analysis using different libraries such as NumPy, Pylab, and Scikit-learn. Linear regression is the most popular machine learning algorithm. A thorough understanding of linear regression would serve as a good foundation for understanding other machine learning algorithms such as logistic regression, K-nearest neighbor, and support vector machine.

# Additional Data Science/Machine Learning Resources

Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

Essential Maths Skills for Machine Learning

3 Best Data Science MOOC Specializations

5 Best Degrees for Getting into Data Science

5 reasons why you should begin your data science journey in 2020

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

How to Organize Your Data Science Project

Productivity Tools for Large-scale Data Science Projects

A Data Science Portfolio is More Valuable than a Resume

Data Science 101 — A Short Course on Medium Platform with R and Python Code Included

** For questions and inquiries, please email me**: benjaminobi@gmail.com