Machine Learning: Dimension Reduction

Emre Can Yesilyurt
Machine Learning Turkiye
5 min readMay 7, 2020

Hello, this article will explain the benefits of dimension reduction and its application.

Today, even a simple IoT sensor produces multidimensional data. First, we encounter the problem of visualizing multidimensional datasets. It is impossible to visualize a twenty-dimensional dataset. However, reducing this twenty-dimensional data set to three or two dimensions will allow visualization. In addition, calculating the distances between data points will cause quite a lot of cost in twenty-dimensional data. In other words, processing multidimensional data has high transaction costs.

Dimension reduction is looking at the data from a different point of view. It can also be thought of as taking a picture of the data while looking from a different point of view. Looking at the data from a different perspective will lead to new qualities and features.

visiondummy.com

While introducing new features (Feature extraction), we can also lose some data. For example, let’s examine the following sample data, from 3-dimensional to 2-dimensional extension.

https://www.osgdigitallabs.com/blogs/2018/4/3/dimensionality-reduction

When you examine the image, you will see that some data points disappear when the 3-dimensional data set is reduced to 2 dimensions. This is because some overlapping data points cause complexity and disappear when moving to a different space. I must say that Principal Component Analysis (PCA) is not a guarantee of not losing data.

Principal Component Analysis’s working principle is in its simplest form; Finding the most relevant data points in as many clusters as the number of dimensions to reduce and combining them linearly. It uses eigenvectors and Eigen matrices to perform all these operations.

For a more professional statistical explanation, you can watch the video below. The purpose of this article is to instill an essential vision.

How to use PCA?

Let’s reduce the Marketing campaign dataset from 29 dimensions to 3 dimensions. This way, it will be possible to visualize the data in 3D.

Importing libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
from sklearn.preprocessing import LabelEncoder
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff
dataset = pd.read_csv('marketing_campaign.csv', sep=';')

Routine operations.

dataset = pd.read_csv('marketing_campaign.csv', sep=';')
dataset.info()
dataset.dropna(inplace = True)

Data preprocessing

dataset["Dt_Customer"] = pd.to_datetime(dataset["Dt_Customer"])
dates = []
for i in dataset["Dt_Customer"]:
i = i.date()
dates.append(i)

days = []
d1 = max(dates)
for i in dates:
delta = d1 - i
days.append(delta)
dataset["Customer_For"] = days
dataset["Customer_For"] = pd.to_numeric(dataset["Customer_For"], errors="coerce")
dataset["Living_With"] = dataset["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})dataset["Education"] = dataset["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Graduate", "PhD":"Graduate"})le = LabelEncoder()
dataset['Education'] = dataset[['Education']].apply(le.fit_transform)
dataset['Living_With'] = dataset[['Living_With']].apply(le.fit_transform)
to_drop = ["Marital_Status", "Dt_Customer", "ID", ]
dataset = dataset.drop(to_drop, axis=1)

Data is compressed into a specific range to reduce the cost of operations.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_data = sc.fit_transform(dataset)

We are ready!

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit(scaled_data)
pca_data = pd.DataFrame(pca.transform(scaled_data), columns=(["col1","col2", "col3"]))
pca_data.describe().T
pca_data.head()
Scene = dict(xaxis = dict(title  = 'Col1'),yaxis = dict(title  = 'Col2'),zaxis = dict(title  = 'Col3'))
trace = go.Scatter3d(x=pca_data['col1'], y=pca_data['col2'], z=pca_data['col3'], mode='markers',marker=dict(color = 'blue', size= 10, line=dict(color= 'black',width = 10)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.show()

As I mentioned above, this process is not just for visualization. Imagine you have a 100-dimensional dataset. Reducing this dataset to 10 dimensions will reduce the computational cost. But keep in mind that model performance may degrade at this point. In some cases, lower transaction costs are better than lower accuracy.

Is there any other method other than PCA?

Linear Discriminant Analysis (LDA)

They are similar to PCA in many respects. PCA deals with data points in the first step and ignores class differences. It does not matter which data point belongs to which class. In this context, PCA is an unsupervised algorithm.

LDA, on the other hand, initially starts the opposite of PCA. It is based on class distinctions. It tries to reveal the best model that distinguishes these classes from each other. In this context, LDA is a supervised algorithm. It tries to maximize the difference between classes. To use it, LDA must label the data.

https://sebastianraschka.com/Articles/2014_python_lda.html#step-3-solving-the-generalized-eigenvalue-problem-for-the-matrix-s_w-1s_b

The only difference from PCA in terms of usage is that the training part gives the labels. Therefore, I only show the training part instead of showing its use in detail.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components = 3)
lda_data = lda.fit_transform(X_train, y_train)

Thank you for reading, and you can reach me from the links below.

--

--