Compressing 4 variables into 3 using Autoencoders — Data_compression[0]

Rushabh Vasani
Analytics Vidhya
Published in
7 min readMay 11, 2020

In this article, we will learn how to train a model that can encode the information of 4 variables into 3, store them on the disk, and then decode the 4 back whenever required.

source: https://pixabay.com/illustrations/analytics-information-innovation-3088958/

Introduction

Compressing the data has always been a big concern in order to optimize data transfer and storage, especially due to the data explosion that is increasing significantly faster. In this article I will try to explain my approach to compress (encode) the data of 4 variables in 3, storing them on the disk and then decompress (decode) the encoded variables and get the original data back whenever required. And I will be sharing the minor tweaks that I did to get the encoder to perform better. This can be considered as a smaller version of image compression as images are simply matrices made up of numbers in a meaningful order. The main motivation of this is to try and apply similar approaches to image compression in the future, In fact.

Spoiler: I will also write on Image Compression soon. 😜

Sub-problems

  1. Preprocessing: To extract useful information from the given data to make the model learn better.
  2. Model definition: Deciding and defining the perfect neural network architecture for the task.
  3. Training: Training the model with the gradient descent approach.

1. Preprocessing

This step refers to manipulating the input data and extract useful information for the neural network to feed to make the neural network can learn better. We will do the preprocessing in 2 steps.

A. Normalization

(Which is pretty common to be known for someone with deep learning background. And if you already know what is normalization and why we use it, feel free skip directly jump to the next step.)

What is Normalization?

Making the inputs having a mean of 0 and a standard deviation of 1. This makes the data scaled fixes the outliers in the data and makes the learning process much easier, smoother, and faster for the computer. Which can be done as the following code snippet:

# train contains the raw(initial) datamean = train.mean()
std = train.std()
train_data = (train - mean) / std
test_data = (test - mean) / std

And normalizing the data doesn’t really affect the original data and the predictions. In this case, when your model predicts the original data from the 3 encoded variables, you can just simply multiply the outcome of the model by the std and add the mean value to get the data back with the original scale.

B. Adding singular values

Singular value decomposition (SVD) is a mathematical process (basically matrix operations) to factorize a matrix into a lower dimension of our choice. SVD basically decomposes our matrix into 3 smaller matrices. One of them contains the singular values of the original matrix. I also tried concatenating different combinations of those 3 matrices with the original data but only singular values worked better than every other combination. If you don’t know anything about SVD it’s really worth having in your toolbox. You can learn more about singular values on this Jupyter notebook and you can run that same notebook yourself on this google colab.

So what we will do is we will first construct a (4 x 4) diagonal matrix from those 4 variables (Those 4 numbers will sit on the diagonal positions of the matrix). And we will feed that matrix to express that matrix in 3 components (3 because we want to compress our variables into 3). That will return us 3 matrices. So I tried concatenating different combinations of those matrices. But using only the singular values(data from only one matrix, which is only 3 numbers, the diagonal elements of that matrix) turned out to be performing the best for me.

“SVD is not nearly as famous as it should be.” — Gilbert Strang

Note: Initially I used Principal Component Analysis(PCA) instead of SVD because I thought that should perform better than SVD, but surprisingly SVD turned out to be slightly better on the comparison. And even SVD is used in implementing PCA in practice, so using SVD instead of PCA would save a bit of computation also. You can learn more about PCA here in this notebook and you also can run the same notebook yourself on this google colab.

Why did I think PCA or SVD might help?

Because the whole concept behind SVD and PCA is to compress the dimensions of the data, keeping as much information as we can. And our task is also similar. But 4 variables can’t be compressed into 3 using SVD or PCA, because the outputs of them are three different matrices, which in any way can not lead to compression up to 3 variables. (So, the takeaway according to me is that the compression using SVD/PCA is more effective on larger matrices.) So, I thought why not first make the matrix 2D (by creating a diagonal matrix, as I described earlier) compress that matrix using SVD or PCA and concatenate some of the compressed data with the original data to see if it helps to compress it further or not. And singular values (one of the three output matrices of SVD) turned out to be helping.

2. Model Definition

This step refers to trying to find the best possible neural network architecture to make the outcomes optimal.

So as the title says we will use an Autoencoder. An Autoencoder consists of 2 internal neural network architectures called Encoded and Decoder.

Source: https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_structure.png
  1. Encoder: The set of sequential layers of the neural network that encodes the data.
  2. Decoder: The layers that reconstruct the original data (decoded /decompressed data) from the code.

A simple Autoencode for our task in Pytorch can be defined as:

class Autoencoder(nn.Module):    # here the parameter'in_features' is set
# to 7 because we have also concatenated the
# 3 singular values with the 4 variables.
def __init__(self, in_features=7):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(in_features, 128),
nn.BatchNorm1d(128),
nn.Tanh(),
nn.Linear(128, 3),
nn.Tanh()
)
self.decoder = nn.Sequential(
nn.Linear(3, 128),
nn.BatchNorm1d(128),
nn.Tanh(),
nn.Linear(128, 4),
nn.Tanh()
)
# In Pytorch, The forward() method is the actual network
#
transformation. The forward method is the mapping that
# maps an input tensor to a prediction output tensor.
def forward(self, x):
encoded = self.encoder(x)
# Time to save the encoded data
decoded = self.decoder(encoded)
return decoded

Note: Here we have added only a few layers for simplicity, but originally there might be more, like a bunch of nn.Linear, nn.BatchNorm1d and nn.Tanh layers. nn.Tanh is there to add the non-linearity in the network.

Why non-linearity?

Because without non-linearity the model would become a bunch of linear functions (layers here) and a combination of more than 1 linear function is again a linear function.

There are other non-linear functions like nn.ReLU as well but nn.Tanh turned out to work better in this case when we tested.

Why Batchnorm?

As we normalize our input data to make the learning process easy for the model, Batchnorm does the same thing for the hidden layers. It normalizes the activations to make the learning process better. Because during the training process some of the activations might have exploded so the Batchnorm scales everything up in the hidden layers.

3. Training

This step refers to finally training the model and fine-tuning the parameters to get the best out of the model.

I used the Cyclical Learning Rates (CLR) approach. This method lets the learning rate cyclically vary between reasonable boundaries, in order to optimize the training. You can learn more about CLR here.

And I used mean squared error as my loss function. And after doing some minor tweaks on the parameters like learning rate, epochs, weight decay, and all I started getting pretty good results.

Results

A. Performance of the different approaches

screenshot from my white paper

B. Validation Losses

screenshot from the jupyter notebook

The graph of the row-wise average Mean Squared Error loss for each data entry in the test set. The average loss is 0.0073.

screenshot from the jupyter notebook

The graph to plot the row-wise [(expected-result) / expected] values for each data entry in the test set for every column separately.

Note: In the last graph, the range is up to10³ because of dividing [expected -result] by smaller numbers.

Conclusion

We can do this kind of minor tweaks into our data and the architecture to improve the model quality.

  • You can also read the white paper that I wrote for the same below.
  • Find the code in the following GitHub repository.

Play around the code on the google colab here.

References

  1. L. N. Smith, A research paper on cyclic learning rates [April 2017]
  2. https://en.wikipedia.org/wiki/Singular_value_decomposition

P.S. Data_compression[1] with Image Compression will be published soon🤓

“Torture the data, and it will confess to anything.”

--

--