Inverting Discriminative Representations with HyperNets

Exploring the utility of HyperNets in overcoming traditional models’ limitations when inverting discriminative representations.

Alexey Potapov
SingularityNET
9 min readJan 8, 2019

--

Introduction

Generative models are essential for general vision systems. They can be used to represent visual concepts, and relations between them, by describing conditional distributions of the images of objects with corresponding attributes, subparts, etc. Learning visual concepts by fitting these distributions into samples of real images can be done with little to no supervision. Different publications have addressed the problem of visual concept learning on the basis of generative models.

Unfortunately, training deep and expressive generative models has proven to be very challenging. Using discriminatively trained image representations is still the favored method in downstream computer vision tasks including both applied tasks of object detection and recognition, and tasks of semantic vision like image caption generation or visual question answering.

That is why inverting discriminatively trained powerful representations is of a certain interest as a practical tool facilitating the construction of better generative models for semantic vision (although one might argue that unsupervised or reinforcement learning of representations is more AGI-ish).

With high-level discriminative features at hand, a decoder or a conditional generative adversarial network can be trained. To further ease the process of training the generative model, intermediate discriminative features can be used to guide the reconstruction or sampling, as in stacked GANs.

In this article, we present an additional option to improve models for inverting discriminative representations: HyperNets. These appear to be beneficial as decoders at least for inverting some discriminative features. Herein we will use the term HyperNets according to the definition set out in our previous post: referring to artificial neural networks that have higher-order connections, i.e., connections on connections.

Task

HyperNets for approximating families of functions

The purpose of discriminative models is to predict target variables (e.g. class labels) given observable variables (e.g. images). The values of these target variables are usually invariant to certain transformations of input patterns. Thus, discriminative models naturally benefit from learning representations that discard irrelevant information. That is why some additional features n are needed in order to generate or reconstruct a concrete image x=g(n|z) despite its high-level discriminative features z=f(x). These features can be considered as “noise” sampled from a certain prior distribution, e.g. n~N(0, I), in purely generative settings (inducing a conditional distribution p(x|z)), or as an extension of the latent code constructed by an additional encoder n=e(x) in the case of autoencoders.

The question is, however, how to mix up these features -z and n- in the generator. The simplest way is just to take their concatenation and train the generator as (an approximation of) a rendering function of two arguments. Alas, these arguments are quite different in their nature. While features “z” describe high-level image content, vector “n” can be put into correspondence to parameters of some image transformation. This is more accurately modeled as a parametric family of functions rather than a single function of two arguments.

As we discussed in one of our previous posts, traditional neural networks are not well suited for approximating families of functions. Specifically, if “n” contains spatial transformation parameters, then ordinary first-order networks fail to generalize the transformation -itself disentangled from the image content. Additionally, the reconstruction of images for those values of transformation parameters appears to be unsuccessful; for which similar images were not presented in the training set.

Conversely, HyperNets successfully deal with this task due to their ability to approximate classes of functions, mapping the function parameters to the weights of the approximating neural network, and naturally disentangling this mapping from the image content.

When the images are subjected to known classes of spatial transformations, specialized models can be efficiently utilized. Nevertheless, inverting discriminative representations is a not a contrived practical case of an unknown transformation to be learned, thus making this task an interesting test application for HyperNets.

Why invert Faster R-CNN?

Faster R-CNN is a widely used method for object detection and recognition. Since it is trained to recognize a limited set of classes, it is not the labels assigned to the extracted bounding boxes around salient objects, but their high-level features that can be of main utility. For example, a number of visual question answering (VQA) models are built on the top of Faster R-CNN (e.g. here). However, the VQA task supposes not just the ability to recognize an extended set of object classes, but also to identify their attributes (both general and class-specific), performed actions, relations, and more which require going down to visual features.

It is common practice to fine-tune relatively general features trained on a wide dataset (e.g. ResNet features trained on ImageNet) or dataset from a target domain. Otherwise, these features will not display a state-of-the-art performance (see, e.g., this paper). Thus, the fine-tuning of Faster R-CNN features is desirable for VQA -albeit not usually carried out. The problem is, however, that labels for bounding boxes are absent in VQA datasets, and fine-tuning cannot be done with supervision.

In such situations, generative models can help because they describe the density distribution of input patterns, and this distribution can be fine-tuned to fit a new dataset without supervision In other words, at first, one needs to train a generative model conditioned by discriminative features on the same dataset as it was used for training a discriminative model.

Prior to attempting to build a fully-fledged generative model, we would like to ask what kind of decoder will be capable of reconstructing the content of bounding boxes from Faster R-CNN features (and such a decoder alone can be useful for unsupervised fine-tuning).

Hereunder we will describe an attempt to create a decoder that will be able to reconstruct images in bounding boxes through their Faster R-CNN features.

Experiments

Reconstructing from Faster R-CNN features only

Let us first start with a naïve model that tries to reconstruct images in bounding boxes solely from the Faster R-CNN features (pre-trained on ImageNet and fine-tuned on Visual Genome). We used the model with the following architecture:

Simple convolutional decoder from Faster R-CNN features

We trained the model on different subsets of Visual Genome.

One could expect that Faster R-CNN features to be rich enough, and original images to be reconstructed from them, however, this is not really the case. The reconstructed images appear to be rather blurry:

Reconstruction of images from the training set based on Faster R-CNN features only
Reconstruction of images from the test set based on Faster R-CNN features only

Deepening the model can help to improve the results, but we will consider different models of the same small depth to make the difference in the reconstruction quality more visible.

Encoding lost information

As it was noted above, discriminative models benefit, in general, from discarding some factors of variation (e.g. concrete position of objects), to which the results of recognition should be invariant. Thus, it is natural to extend the features from Faster R-CNN with some latent features constructed by an additional encoder in such a way that the original images can be successfully reconstructed. These additional features are supposed to be parameters of some transformation, and their number should not be large in order to avoid a direct encoding of image content in them.

Basically, an autoencoder, which encoder contains a pre-trained fixed part, should be trained. We used the following architecture for this:

Additional encoder for recovering lost information

The decoder is the same as before, but it takes concatenated Faster R-CNN and with extra learned features as input:

The same decoder as above, but from concatenated features

Not surprisingly, this model can produce better reconstructions. However, while its results on the train set are reasonable, reconstructions of test images are still bad:

Reconstruction of images from the training set with the use of additional features
Reconstruction of images from the test set with the use of additional features

Although the decoder uses Faster R-CNN features, it fails to generalize beyond the training set on how to combine them with additional latent features.

Utilizing HyperNets

As we discussed in one of our previous posts, traditional “flat” decoders fail to learn how to transform images disentangled from their content even given explicit transformation parameters as input. On the other hand, HyperNets can do this.

Here, we do not know what transformation should be applied and what parameters it has, but we can hope that an additional encoder will learn such latent code and that a HyperNet will learn the transformation that the decoder will be able to reconstruct from the images.

We used the following architecture for the decoder, while the additional encoder is identical to the one above, but its features are fed to the control network instead of concatenating them with Faster R-CNN features (the code can be found here):

HyperNet decoder with the control network

HyperNet achieves considerably better reconstruction:

Reconstruction of images (top) from the train set by HyperNet (bottom) in comparison with traditional models (middle) of the same depth
Reconstruction of images (top) from the test set by HyperNet (bottom) in comparison with traditional models (middle) of the same depth

It should be highlighted that the model really uses Faster R-CNN features, and does not ignore them by simply training an additional encoder. Indeed, if we train the same model, but replace Faster R-CNN features with zeros, the result -even on the training set- would be:

Results of reconstruction with ablated Faster R-CNN features

At the same time, the reconstruction results can be improved (both for hypernets and baseline autoencoders) either by making the model deeper or by eliminating upsampling (unfolding the dense latent code into the feature maps of the same size as the image to be reconstructed). In the first case, the results will be slightly more blurry, but less noisy, while in the second case, the reconstruction will be sharper, but noisier:

Reconstruction by the deeper decoder (top — images from the training set, bottom — novel images)
Reconstruction by the decoder without upsampling (top — images from the training set, bottom — novel images)

Interestingly, the results of reconstruction are worse (in the case of test images) for the images, of which the content is semantically less clear.

To conclude, what is important to note here is the fact that the separation of the latent code into the features describing the image content and the parameters of its transformation eases the task of learning the inversion of representation. These parts of the latent code have different computational effects. If one tries to learn representation, in which all latent variables induce the same kind of computations, it would be more difficult to disentangle these two types of variables. Hence, considering the aforementioned limitations of more traditional methods, it would be worthwhile to use HyperNets in the task of learning disentangled representations.

How Can You Get Involved?

If you would like to learn more about SingularityNET, we have a passionate and talented community which you can connect with by visiting our Community Forum. Feel free to say hello and to introduce yourself here. We are proud of our developers and researchers that are actively publishing their research for the benefit of the community; you can read the research here.

For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed about all of our developments.

--

--

No responses yet