Autoencoder for Dimensionality Reduction

Sanket Barhate
2 min readMay 11, 2020

Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data.

An autoencoder is a neural network that learns to copy its input to its output. It has an internal (hidden) layer that describes a code used to represent the input, and it is constituted by two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the original input.

Import the libraries

from sklearn.datasets import make_blobs
import pandas as pd
import matplotlib.pyplot as plt
from keras.models import Sequential, Model
from keras.layers import Dense
from keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler

Create a Sample data

We will create a sample data using sklearn’s inbuilt function make_blobs.

data = make_blobs(n_samples=2000 , n_features=20 , centers=5)

This creates data having 2000 samples and 20 features(columns) with 5 types of clusters.

print(data[0])

[[ 5.5438103 -8.05147034 -3.57636095 … 9.34667082 1.66723373
0.94848143]
[ 4.21084033 -0.06437965 -9.08081011 … -8.1160412 1.14398545
0.21960114]
[ 2.67190904 4.1588543 4.62791588 … -10.38424341 -7.69808132
-0.05888503]

[ 2.53794745 1.13510935 -11.61939792 … -8.57644854 0.71590736
-0.03426078]
[ 4.28830608 1.42536499 -10.41514261 … -7.98307924 -0.41028617
-0.2511404 ]
[ 2.45934372 -4.85194789 8.55294372 … 2.62053799 4.06873245
-2.48642709]]

print(data[1])

[1 4 0 … 4 4 2]

data[0] contains the data and data[1] contains labels.

Scaling the data

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)

This converts all the values between 0 and 1.

(Do not apply this on labels!)

Finally!

Creating the autoencoder

We will reduce the dimensions from 20 to 2 and will try to plot the encoded data.

m = Sequential()
m.add(Dense(20, activation='elu', input_shape=(20,)))
m.add(Dense(10, activation='elu'))
m.add(Dense(2, activation='linear', name="bottleneck"))
m.add(Dense(10, activation='elu'))
m.add(Dense(20, activation='sigmoid'))
m.compile(loss='mean_squared_error', optimizer = Adam())
history = m.fit(scaled_data, scaled_data, batch_size=128, epochs=20, verbose=1)

Get the encoded and decoded data

encoder = Model(m.input, m.get_layer('bottleneck').output)
data_enc = encoder.predict(scaled_data) # bottleneck representation
data_dec = m.predict(scaled_data) # reconstruction

Plotting the data

plt.scatter(data_enc[:,0], data_enc[:,1], c=data[1][:], s=8, cmap='tab10')
plt.show()
plt.tight_layout()

As we can see autoencoder has retained the information even after the reduction of dimensions from 20 to 2.

Conclusion

Autoencoders perform very well and retain all the information of the original data set. They do have drawbacks with computation and tuning, but the trade-off is higher accuracy. They are great at visualizing the data since all the information is retained in 2 or 3 dimensions.

--

--