Bank Data: PCA

Zaki Jefferson
Analytics Vidhya
Published in
3 min readAug 24, 2020

The dataset that will be used for this example is on Kaggle. This discussion will be about the process of using PCA on the Bank data.

What is PCA?

PCA, Principal Component Analysis, is a statistical procedure that uses an orthogonal transformation which converts a set of correlated variables to a set of uncorrelated variables.

PCA is a tool that is mostly used for Exploratory Data Analysis, EDA, and in machine learning predictive modeling. You can also use PCA for dimensionality reduction, this is also known as feature extraction. This become useful when wanting to make a dataset simpler by reducing the amount of features you have in your dataset. You would want to do this to reduce computational complexity of the model, causing your models to run faster.

Step 1: Scale

Always remember to scale your data before performing PCA. Scaling your data is important because the higher value a variable is, the more important it will become.

The image above shows that our dependent variable is Loan Status, a train test split is being performed, and we use MinMaxScaler to scale the data.

We use MinMaxScaler over Standard Scaler because our data is not in a Gaussian Distribution in order to use Standard Scaler properly. We also won’t be using any machine learning models with assumptions of our data being in a Gaussian Distribution. This saves us from transforming the data, one less extra step to do.

Step 2: PCA

After the data is scaled, we can move on to performing PCA. We can use the PCA class from sklearn and fit our training data.

pca.explained_variance_ratio_

The instance above will show the values that our machine learning model thought to have the most importance on our dependent variable. The higher the variance, the more important that feature is.

We can also graph our results to see the features that we would want to keep.

fig, ax = plt.subplots()# Setting width and height
fig.set_figheight(10)
fig.set_figwidth(15)
# x and y values
xi = np.arange(0, 26, step=1)
y = np.cumsum(pca.explained_variance_ratio_)
plt.ylim(0.0, 1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')
plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 26, step=1)) # change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')
plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
ax.grid(axis='x')
plt.show()

--

--

Zaki Jefferson
Analytics Vidhya

Data Scientist | Data Science Consultant. I work with companies and individuals to help leverage the abundance of data to help grow their ideas and business!