# Training a Machine Learning Model on a Dataset with Highly-Correlated Features

In the previous article (Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot), we’ve shown that a covariance matrix plot can be used for feature selection and dimensionality reduction.

Using the cruise ship dataset cruise_ship_info.csv, we found that out of the 6 predictor features [‘age’, ‘tonnage’, ‘passengers’, ‘length’, ‘cabins’, ‘passenger_density’], if we assume important features have a correlation coefficient of 0.6 or greater with the target variable, then the target variable “crew” correlates strongly with 4 predictor variables: “tonnage”, “passengers”, “length, and “cabins”.

We, therefore, were able to reduce the dimension of our feature space from 6 to 4.

Now, suppose we want to build a model on the new feature space for predicting the crew variable. Our model can be expressed in the form:

where X is the feature matrix, and w the weights to be learned during training.

Looking at the covariance matrix plot between features, we see that there is a strong correlation between the features (predictor variables), see the image above.

# How do we deal with the problem of correlations between features?

In this article, we shall use a technique called Principal Component Analysis (PCA) to transform our features into space where the features are independent or uncorrelated. We shall then train our model on the PCA space. You may find out more about PCA from this article: Machine Learning: Dimensionality Reduction via Principal Component Analysis.

# Training the Model on the PCA Space

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns`
`df=pd.read_csv("cruise_ship_info.csv")df.head()`
`cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']  df[cols_selected].head()`
`from sklearn.model_selection import train_test_splitX = df[cols_selected].iloc[:,0:4].values     y = df[cols_selected]['crew']X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=0)`
`from sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import r2_scoretrain_score = []test_score = []cum_variance = []for i in range(1,5):    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=0)    y_train_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()        pipe_lr = Pipeline([('scl', StandardScaler()),                        ('pca', PCA(n_components=i)),                        ('slr',   LinearRegression())])     pipe_lr.fit(X_train, y_train_std)    y_train_pred_std = pipe_lr.predict(X_train)    y_test_pred_std = pipe_lr.predict(X_test)    y_train_pred=sc_y.inverse_transform(y_train_pred_std)    y_test_pred=sc_y.inverse_transform(y_test_pred_std)    train_score = np.append(train_score,                             r2_score(y_train, y_train_pred))    test_score = np.append(test_score,                            r2_score(y_test, y_test_pred))    cum_variance = np.append(cum_variance, np.sum(pipe_lr.fit(X_train, y_train).named_steps['pca'].explained_variance_ratio_))`

Here is the output from the Regression Model on PCA Space:

Based on this summary, we see that 95 percent of the variance is contributed by the first principal component alone. This means that in the final model, only the first principal component PC1 could be used since the other 3 components PC2, PC3, and PC4 contribute only about 5% of the total variance.

PCA for dimensionality reduction doesn’t seem like a big deal for a dataset with 4 features, but for a complex dataset having hundreds or even thousands of features, PCA can be a powerful tool that can be used for removing correlation between features and helps to decrease the computational time for model training, testing, and evaluation.

In summary, we have shown how the PCA algorithm can be implemented using python’s sklearn package with the cruise ship dataset. PCA is such a powerful tool for model building, especially when dealing with a complex dataset with highly-correlated features. You can download the entire dataset and code for this article on Github.

# References

1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
2. Benjamin O. Tayo, Machine Learning Model for Predicting a Ships Crew Size, https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size.

Written by

## Towards AI

#### Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade