Autoencoder for Dimensionality Reduction
Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data.
An autoencoder is a neural network that learns to copy its input to its output. It has an internal (hidden) layer that describes a code used to represent the input, and it is constituted by two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the original input.
Import the libraries
from sklearn.datasets import make_blobs
import pandas as pd
import matplotlib.pyplot as plt
from keras.models import Sequential, Model
from keras.layers import Dense
from keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler
Create a Sample data
We will create a sample data using sklearn’s inbuilt function make_blobs.
data = make_blobs(n_samples=2000 , n_features=20 , centers=5)
This creates data having 2000 samples and 20 features(columns) with 5 types of clusters.
print(data[0])
[[ 5.5438103 -8.05147034 -3.57636095 … 9.34667082 1.66723373
0.94848143]
[ 4.21084033 -0.06437965 -9.08081011 … -8.1160412 1.14398545
0.21960114]
[ 2.67190904 4.1588543 4.62791588 … -10.38424341 -7.69808132
-0.05888503]
…
[ 2.53794745 1.13510935 -11.61939792 … -8.57644854 0.71590736
-0.03426078]
[ 4.28830608 1.42536499 -10.41514261 … -7.98307924 -0.41028617
-0.2511404 ]
[ 2.45934372 -4.85194789 8.55294372 … 2.62053799 4.06873245
-2.48642709]]
print(data[1])
[1 4 0 … 4 4 2]
data[0] contains the data and data[1] contains labels.
Scaling the data
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
This converts all the values between 0 and 1.
(Do not apply this on labels!)
Finally!
Creating the autoencoder
We will reduce the dimensions from 20 to 2 and will try to plot the encoded data.
m = Sequential()
m.add(Dense(20, activation='elu', input_shape=(20,)))
m.add(Dense(10, activation='elu'))
m.add(Dense(2, activation='linear', name="bottleneck"))
m.add(Dense(10, activation='elu'))
m.add(Dense(20, activation='sigmoid'))
m.compile(loss='mean_squared_error', optimizer = Adam())
history = m.fit(scaled_data, scaled_data, batch_size=128, epochs=20, verbose=1)
Get the encoded and decoded data
encoder = Model(m.input, m.get_layer('bottleneck').output)
data_enc = encoder.predict(scaled_data) # bottleneck representation
data_dec = m.predict(scaled_data) # reconstruction
Plotting the data
plt.scatter(data_enc[:,0], data_enc[:,1], c=data[1][:], s=8, cmap='tab10')
plt.show()
plt.tight_layout()
As we can see autoencoder has retained the information even after the reduction of dimensions from 20 to 2.
Conclusion
Autoencoders perform very well and retain all the information of the original data set. They do have drawbacks with computation and tuning, but the trade-off is higher accuracy. They are great at visualizing the data since all the information is retained in 2 or 3 dimensions.