In the previous article (Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot), we’ve shown that a covariance matrix plot can be used for feature selection and dimensionality reduction.
Using the cruise ship dataset cruise_ship_info.csv, we found that out of the 6 predictor features [‘age’, ‘tonnage’, ‘passengers’, ‘length’, ‘cabins’, ‘passenger_density’], if we assume important features have a correlation coefficient of 0.6 or greater with the target variable, then the target variable “crew” correlates strongly with 4 predictor variables: “tonnage”, “passengers”, “length, and “cabins”. We, therefore, were able to reduce the dimension of our feature space from 6 to 4.
Now, suppose we want to build a model on the new feature space for predicting the crew variable. Our model can be expressed in the form:
In this article, we show how we can train, test, and evaluate our model using a method called k-fold cross-validation.
What is k-fold Cross-validation?
Cross-validation is a method of evaluating a machine learning model’s performance across random samples of the dataset. This assures that any biases in the dataset are captured. Cross-validation can help us to obtain reliable estimates of the model’s generalization error, that is, how well the model performs on unseen data.
In k-fold cross-validation, the dataset is randomly partitioned into training and testing sets. The model is trained on the training set and evaluated on the testing set. The process is repeated k-times. The average training and testing scores are then calculated by averaging over the k-folds.
Here is the k-fold cross-validation pseudocode:
Implementing k-fold cross-validation
1. Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
2. Read dataset and select important features
cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']
3. Model building and evaluation using k-fold cross-validation
X = df[cols_selected].iloc[:,0:4].values
y = df[cols_selected]['crew']from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipelinefrom sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
sc_y = StandardScaler()
sc_x = StandardScaler()
y_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()train_score = 
test_score = for i in range(10):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=i)
y_train_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()
pipe_lr = Pipeline([('scl', StandardScaler()),
('slr', LinearRegression())]) pipe_lr.fit(X_train, y_train_std) y_train_pred_std = pipe_lr.predict(X_train)
y_test_pred_std = pipe_lr.predict(X_test) y_train_pred = sc_y.inverse_transform(y_train_pred_std)
y_test_pred = sc_y.inverse_transform(y_test_pred_std) train_score = np.append(train_score, r2_score(y_train, y_train_pred))
test_score = np.append(test_score, r2_score(y_test, y_test_pred))
4. The output from model training, testing, and evaluation
In summary, we’ve shown how cross-evaluation can be used for evaluating the performance of a machine learning model. The results from a cross-validation calculation are a good estimate of the accuracy of the model when deployed into production and tested against real unseen data. Cross-validation is, therefore, a very powerful technique that can help us to obtain reliable estimates of the model’s generalization error, that is, how well the model performs on unseen data.
- Training a Machine Learning Model on a Dataset with Highly-Correlated Features.
- Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot.
- Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
- Benjamin O. Tayo, Machine Learning Model for Predicting a Ships Crew Size, https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size.