Paper summary on AET vs AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data

Rohit Arora
3 min readJun 11, 2020

--

Supervised learning is old and expensive and Unsupervised learning is exactly its opposite. The main problem in supervised methods that motivated research in the opposite path was most obviously related to DATA. There are many problems to which a proper structured labeled data is not possible to curate. Therefore, the almost always reliable supervised path will not suffice. And the only problem with the unsupervised path had been the results, until this paper. This paper precisely addresses the issue at hand using Auto-Encoder Transformations (AETs). The AETs with its abilities to unrestrictedly learn deep feature representations give reliability and results that are on par with the SOTA models of supervised learning.

The authors throughout the paper brush over briefly but repeatedly over the concepts of unsupervised methods through Zhang et al’s and J.Donahue et al’s work about contrastive auto-encoders and GANs based adversarial feature learning respectively, and self-supervised representation learning through Doersch et al’s and Dosovitskiy et al’s work on gaining self-supervised information by selecting random patches from images and performing the various transformations to create surrogate classes that were used for training the model. Although some of the parts in the paper were redundant the authors were able to build a tightly knit story that led up to their proposal of AETs.

The AETs main feature is not only to reveal the static visual structures but also to make evident the changes caused to them by applying various transformations. And it does so by sampling a set of transforms from distribution to apply to the data samples and train the encoder to extract a representation between the original and the transformed images. The focal point here is that there are no restrictions on the nature of the transformations pertinent to the AETs. So more the flexibility of exploring various transformations is more reliable and better the model gets.

The three genres of transformation the author explored to instantiate the AET models were: parameterized, GAN-induced, and non-parameterized. In parametrized transformation, each transformation is represented by its parameters and the loss between transformations can be calculated by the difference between their parameters. In GAN-induced transformation randomly sampled noise can be used to parameterize the transformation and the loss can be directly calculated by using L2 loss between the noise parameter that is fed to the encoder and the noise parameter decoded after training the decoder using the features from the encoder. And in case of non-parameterized transformation, even though it is hard to parameterize we can use parameterized transformation to estimate the actual transformation. This may seem like a huge approximation but here our goal is not to get an accurate estimate of input transformation but to learn a good feature representation.

For the sake of simplicity and keeping in mind the shortcomings of existing GANs, all the experiments for AETs were carried out using parametric transformation. After obtaining the representations by the encoder-decoder architecture a separate classifier is trained over the learned representations to judge the quality of the learned features. In the end, we can conclude the paper by stating that after thorough experimentation with various transformations on AET it enables the encoder to learn very useful representations and thus comes on par with SOTA fully supervised learning models.

Link for the paper and code.

--

--