Machine Learning Process Tutorial

Extensive tutorial illustrating the machine learning process using the cruise ship dataset with python code included

Introduction

Image for post
Image for post
Figure 1. Illustrating the Machine Learning Process. Image by Benjamin O. Tayo.

In this article, we present a practical tutorial of the machine learning process using the cruise ship dataset cruise_ship_info.csv. The dataset and Jupyter notebook for this tutorial can be downloaded from here: https://github.com/bot13956/Machine_Learning_Process_Tutorial.

1. Problem Framing

Objective: The goal of this project is to build a regressor model that recommends the “crew” size for potential cruise ship buyers using the cruise ship dataset cruise_ship_info.csv.

2. Data Analysis

2.1 Import necessary libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2.2 Read dataset and display columns

df = pd.read_csv("cruise_ship_info.csv")df.head()
Image for post
Image for post
Table 1: Show first 5 rows of dataset.

2.3 Calculate the covariance matrix

cols = ['Age', 'Tonnage', 'passengers', 'length', 
'cabins','passenger_density','crew']
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df[cols].iloc[:,range(0,7)].values
cov_mat = np.cov(X_std.T)

2.4 Generate a heatmap for visualizing the covariance matrix

plt.figure(figsize=(10,10))sns.set(font_scale=1.5)hm = sns.heatmap(cov_mat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 12},
yticklabels=cols,
xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients')
plt.tight_layout()
plt.show()
Image for post
Image for post
Figure 2. Covariance matrix plot.

2.5 Feature selection using covariance matrix plot

cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']df[cols_selected].head()
Image for post
Image for post
Table 2. First 5 rows of important features and predictor variable.

2.6 Define your features matrix and target variable

X = df[cols_selected].iloc[:,0:4].values    # features matrix  y = df[cols_selected]['crew'].values        # target variable

The features matrix and the target variable obtained above can then be used for model building.

3. Model Building

Since our goal is to use regression, we will implement 3 different regression algorithms: Linear Regression (LR), KNeighbors Regression (KNR), and Support Vector Regression (SVR).

The data set has to be divided into training, validation, and test sets. Hyperparameter tuning is used to fine-tune the model in order to prevent overfitting. Cross-validation is performed to ensure the model performs well on the validation set. After fine-tuning model parameters, the model is applied to the test data set. The model’s performance on the test data set is approximately equal to what would be expected when the model is used for making predictions on unseen data.

3.1 Model building and evaluation

from sklearn.metrics import r2_scorefrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.svm import SVRfrom sklearn.pipeline import Pipelinepipe_lr = Pipeline([('scl', StandardScaler()),('lr',    
LinearRegression())])
pipe_knr = Pipeline([('scl', StandardScaler()),('knr',
KNeighborsRegressor(n_neighbors = 3))])
pipe_svr = Pipeline([('scl', StandardScaler()),('svr',
SVR(kernel='linear',C=1.0))])
sc_y = StandardScaler()train_score_lr = []train_score_knr = []train_score_svr = []n = 15for i in range(n):
X_train, X_test, y_train, y_test = train_test_split( X, y,
test_size=0.3, random_state=i)
y_train_std = sc_y.fit_transform(y_train[:,
np.newaxis]).flatten()
train_score_lr = np.append(train_score_lr,
np.mean(cross_val_score(pipe_lr,
X_train, y_train_std,
scoring ='r2' , cv = 10)))
train_score_knr = np.append(train_score_knr,
np.mean(cross_val_score(pipe_knr,
X_train, y_train_std,
scoring ='r2' , cv = 10)))
train_score_svr = np.append(train_score_svr,
np.mean(cross_val_score(pipe_svr,
X_train, y_train_std,
scoring ='r2' , cv = 10)))
train_mean_lr = np.mean(train_score_lr)train_std_lr = np.std(train_score_lr)train_mean_knr = np.mean(train_score_knr)train_std_knr = np.std(train_score_knr)train_mean_svr = np.mean(train_score_svr)train_std_svr = np.std(train_score_svr)

3.2 Output from machine learning model

print('R2 train for lr: %.3f +/- %.3f' % 
(train_mean_lr,train_std_lr))
print('R2 train for knn_lr: %.3f +/- %.3f' %
(train_mean_knr,train_std_knr))
print('R2 train for svm_lr: %.3f +/- %.3f' %
(train_mean_svr,train_std_svr))
Image for post
Image for post

3.3 Generate visualization of cross-validation score

plt.figure(figsize=(15,11))plt.plot(range(n),train_score_lr,color='blue', linestyle='dashed', 
marker='o',markerfacecolor='blue', markersize=10, label= 'Linear Regression')
plt.plot(range(n),train_score_knr,color='green', linestyle='dashed',
marker='s',markerfacecolor='green', markersize=10, label = 'KNeighbors Regression')
plt.plot(range(n),train_score_svr,color='red', linestyle='dashed',
marker='^',markerfacecolor='red', markersize=10, label = 'Support Vector Regression')
plt.grid()
plt.ylim(0.7,1)
plt.title ('Mean cross-validation R2 score vs. random state parameter', size = 14)
plt.xlabel('Random state parameter', size = 14)
plt.ylabel('Mean cross-validation R2 score', size = 14)
plt.legend()
plt.show()
Image for post
Image for post
Figure 3. Mean cross-validation shows for different regression models.

3.4 Model’s performance on test set

pipe_lr.fit(X_train, y_train_std)pipe_knr.fit(X_train, y_train_std)pipe_svr.fit(X_train, y_train_std)r2_score_lr = r2_score(y_test, 
sc_y.inverse_transform(pipe_lr.predict(X_test)))
r2_score_knr = r2_score(y_test,
sc_y.inverse_transform(pipe_knr.predict(X_test)))
r2_score_svr = r2_score(y_test,
sc_y.inverse_transform(pipe_svr.predict(X_test)))
print('R2 test for lr: %.3f ' % r2_score_lr)print('R2 test for knr: %.3f ' % r2_score_knr)print('R2 test for svr: %.3f ' % r2_score_svr)
Image for post
Image for post

4. Application

In this stage, the final machine learning model is selected and put into production. The model is evaluated in a production setting in order to assess its performance. Any mistakes encountered when transforming from an experimental model to its actual performance on the production line has to be analyzed. This can then be used in fine-tuning the original model.

Based on the results from section 3, we observe that Linear Regression and Support Vector Regression perform almost at same level, and better than KNeighbors Regression. So the final model selected could either be Linear Regression or Support Vector Regression.

In summary, we have discussed the main stages of a machine learning process. This tutorial illustrates the practical steps involved in the machine learning process. We have illustrated our calculations using a regression problem, but the same process is applicable for classification projects equally.

Additional Data Science/Machine Learning Resources

Data Science Curriculum

Essential Maths Skills for Machine Learning

3 Best Data Science MOOC Specializations

5 Best Degrees for Getting into Data Science

5 reasons why you should begin your data science journey in 2020

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

How to Organize Your Data Science Project

Productivity Tools for Large-scale Data Science Projects

A Data Science Portfolio is More Valuable than a Resume

Data Science 101 — A Short Course on Medium Platform with R and Python Code Included

For questions and inquiries, please email me: benjaminobi@gmail.com

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Benjamin Obi Tayo Ph.D.

Written by

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Predictive Analytics, Materials Sciences, Biophysics

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Benjamin Obi Tayo Ph.D.

Written by

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Predictive Analytics, Materials Sciences, Biophysics

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store