Machine Learning Process Tutorial
Extensive tutorial illustrating the machine learning process using the cruise ship dataset with python code included
The machine learning process includes 4 main stages:
In this article, we present a practical tutorial of the machine learning process using the cruise ship dataset cruise_ship_info.csv. The dataset and Jupyter notebook for this tutorial can be downloaded from here: https://github.com/bot13956/Machine_Learning_Process_Tutorial.
1. Problem Framing
Define your project goals. What do you want to find out? Do you have the data to analyze?
Objective: The goal of this project is to build a regressor model that recommends the “crew” size for potential cruise ship buyers using the cruise ship dataset cruise_ship_info.csv.
2. Data Analysis
Import the dataset, analyze features to select the relevant features that correlate with the target variable.
2.1 Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
2.2 Read dataset and display columns
df = pd.read_csv("cruise_ship_info.csv")df.head()
2.3 Calculate the covariance matrix
cols = ['Age', 'Tonnage', 'passengers', 'length',
'cabins','passenger_density','crew']from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df[cols].iloc[:,range(0,7)].values
cov_mat = np.cov(X_std.T)
2.4 Generate a heatmap for visualizing the covariance matrix
plt.figure(figsize=(10,10))sns.set(font_scale=1.5)hm = sns.heatmap(cov_mat,
plt.title('Covariance matrix showing correlation coefficients')
2.5 Feature selection using covariance matrix plot
From the covariance matrix plot above, we see that the “crew” variable correlates strongly (correlation coefficient ≥ 0.6) with 4 predictor variables: “Tonnage”, “passengers”, “length, and “cabins”.
cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']df[cols_selected].head()
2.6 Define your features matrix and target variable
X = df[cols_selected].iloc[:,0:4].values # features matrix y = df[cols_selected]['crew'].values # target variable
The features matrix and the target variable obtained above can then be used for model building.
3. Model Building
Pick the machine learning tool that matches your data and desired outcome. Train the model with available data.
Since our goal is to use regression, we will implement 3 different regression algorithms: Linear Regression (LR), KNeighbors Regression (KNR), and Support Vector Regression (SVR).
The data set has to be divided into training, validation, and test sets. Hyperparameter tuning is used to fine-tune the model in order to prevent overfitting. Cross-validation is performed to ensure the model performs well on the validation set. After fine-tuning model parameters, the model is applied to the test data set. The model’s performance on the test data set is approximately equal to what would be expected when the model is used for making predictions on unseen data.
3.1 Model building and evaluation
from sklearn.metrics import r2_scorefrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.svm import SVRfrom sklearn.pipeline import Pipelinepipe_lr = Pipeline([('scl', StandardScaler()),('lr',
LinearRegression())])pipe_knr = Pipeline([('scl', StandardScaler()),('knr',
KNeighborsRegressor(n_neighbors = 3))])pipe_svr = Pipeline([('scl', StandardScaler()),('svr',
SVR(kernel='linear',C=1.0))])sc_y = StandardScaler()train_score_lr = train_score_knr = train_score_svr = n = 15for i in range(n):
X_train, X_test, y_train, y_test = train_test_split( X, y,
y_train_std = sc_y.fit_transform(y_train[:,
train_score_lr = np.append(train_score_lr,
scoring ='r2' , cv = 10)))
train_score_knr = np.append(train_score_knr,
scoring ='r2' , cv = 10)))
train_score_svr = np.append(train_score_svr,
scoring ='r2' , cv = 10)))train_mean_lr = np.mean(train_score_lr)train_std_lr = np.std(train_score_lr)train_mean_knr = np.mean(train_score_knr)train_std_knr = np.std(train_score_knr)train_mean_svr = np.mean(train_score_svr)train_std_svr = np.std(train_score_svr)
3.2 Output from machine learning model
print('R2 train for lr: %.3f +/- %.3f' %
(train_mean_lr,train_std_lr))print('R2 train for knn_lr: %.3f +/- %.3f' %
(train_mean_knr,train_std_knr))print('R2 train for svm_lr: %.3f +/- %.3f' %
3.3 Generate visualization of cross-validation score
marker='o',markerfacecolor='blue', markersize=10, label= 'Linear Regression')plt.plot(range(n),train_score_knr,color='green', linestyle='dashed',
marker='s',markerfacecolor='green', markersize=10, label = 'KNeighbors Regression')plt.plot(range(n),train_score_svr,color='red', linestyle='dashed',
marker='^',markerfacecolor='red', markersize=10, label = 'Support Vector Regression')plt.grid()
plt.title ('Mean cross-validation R2 score vs. random state parameter', size = 14)
plt.xlabel('Random state parameter', size = 14)
plt.ylabel('Mean cross-validation R2 score', size = 14)
3.4 Model’s performance on test set
pipe_lr.fit(X_train, y_train_std)pipe_knr.fit(X_train, y_train_std)pipe_svr.fit(X_train, y_train_std)r2_score_lr = r2_score(y_test,
sc_y.inverse_transform(pipe_lr.predict(X_test)))r2_score_knr = r2_score(y_test,
sc_y.inverse_transform(pipe_knr.predict(X_test)))r2_score_svr = r2_score(y_test,
sc_y.inverse_transform(pipe_svr.predict(X_test)))print('R2 test for lr: %.3f ' % r2_score_lr)print('R2 test for knr: %.3f ' % r2_score_knr)print('R2 test for svr: %.3f ' % r2_score_svr)
Score your final model to generate predictions. Make your model available for production. Retrain your model as needed.
In this stage, the final machine learning model is selected and put into production. The model is evaluated in a production setting in order to assess its performance. Any mistakes encountered when transforming from an experimental model to its actual performance on the production line has to be analyzed. This can then be used in fine-tuning the original model.
Based on the results from section 3, we observe that Linear Regression and Support Vector Regression perform almost at same level, and better than KNeighbors Regression. So the final model selected could either be Linear Regression or Support Vector Regression.
In summary, we have discussed the main stages of a machine learning process. This tutorial illustrates the practical steps involved in the machine learning process. We have illustrated our calculations using a regression problem, but the same process is applicable for classification projects equally.
Additional Data Science/Machine Learning Resources
For questions and inquiries, please email me: firstname.lastname@example.org