A new kind of deep neural networks

by Alfredo Canziani, Abishek Chaurasia and Eugenio Culurciello

There is a new wave of deep neural networks coming. They are the evolution of feed-forward models, that we previously analyzed in detail.

The new kind of neural networks are an evolution of the initial feed-forward model of LeNet5 / AlexNet and derivatives, and include more sophisticated by-pass schemes than ResNet / Inception. These feedforward neural networks are also called encoders, as they compress and encode images into smaller representation vectors.

The new wave of neural networks have two important new features:

  • generative branches: also called decoders, as they project a representation vector back into the input space
  • recurrent layers: that combine representations from previous time steps with the inputs and representations of the current time step

Great! But what can this added complexity do for us?

It turns out conventional feed-forward neural networks have a lot of limitations:

1- cannot localize precisely: due to downsampling and loss of spatial resolution in higher layers, localization of features / objects / classes is impaired

2- cannot reason about the scene: because they compress an image into a short representation code, they lose information on how the image is composed, and how parts of the image or scene are spatially arranged

2- have temporal instabilities: since they are trained on still images, they did not learn smooth spatio-temporal transformations of objects motion in space. They can recognize categories of objects in some image, but not others, and are very sensitive to adversarial noise and perturbations

3- cannot predict: since they do not use temporal information, feed-forward neural networks provide a new representation code at each frame, only based on the current input, but cannot predict what will happen in the next few frames (note: with exceptions, not trained on videos)

To surpass these limitation, we need a new kind of network that can project back a learned representation into the input image space, and also that can be trained on temporally coherent sequences of images: we need to train on videos.

This is a list of advanced features that these new networks can provide:

  • unsupervised learning: they can be pre-trained on videos to predict future frames or representation, thus needing much less labeled data (expensive on videos!) to train to perform some tasks
  • segmentation: segment different objects in an image
  • scene-parsing: follows from segmentation, if dataset has per-pixel object labels, used in autonomous driving and augmented reality
  • localization: follows from segmentation and perfect object boundaries, all scene-parsing and segmentation networks can do this!
  • spatio-temporal representation: use video for training, not just still images, know concept of time and temporal relationships
  • video prediction: some network are design to predict future frames in a video
  • representation prediction: some network can predict the representation of future frames in a video
  • ability to perform online learning by monitoring the error signal between predictions and actual future frames or representations

Let us now examine the details and implementations of these new networks, as follows.

Generative ladder networks

These models use an encoder and a decoder pair to segment images into parts and objects. Examples are: ENet, SegNet, Unet, DenseNets, ladder networks, and many more.

Here is a representative 3-layer model:

D modules are standard feed-forward layers. G modules are generative modules, similar to standard feed-forward layers but with de-convolution and upsampling. They also use residual-like connections “res” to connect the representation at each encoder layer to the one of decoder layers. This forces the representation of generative layers to be modulated by the feed-forward representation, and thus have a stronger ability to localize and parse the scene into objects and parts. “x” is the input image and “y” is the output segmentation at the same time step.

These network can perform segmentation, scene-parsing and precise localization, but do not operate in the temporal domain, and have no memory of the past frames.

The encoder to decoder by-pass at each layer recently helped these network to achieve state-of-the-art performance.

Recursive and generative ladder networks

One of the newest deep neural network architectures adds recursion to generative ladder networks. These are Recursive and Generative ladder networks (REGEL, we call them CortexNet models), and they are one of the most complex deep neural network models to date, at least for image analysis.

Here is a 3-layer model of one network we currently use:

D and G modules are practically identical to the ones in Generative ladder networks described above. These network adds a recurrent path “t-1” from each G module to the respective D module in the same layer.

These networks take a sequence of frames from a video as input x[t], and at each time-step they predict the next frame in the video y[t+1], which is close to x[t+1], if the prediction is accurate.

Since this network can measure the error between the prediction and the actual next frame, it knows when it is able to predict an input or not. If not, it can activate incremental learning, something feed-forward networks are not able to do. It is thus able to perform inherent online learning.

We think this is a very important features of machine learning that is a prerogative of predictive neural networks. Without this feature, network are not able to provide a true predictive confidence signals, and are not able to perform effective incremental learning.

These networks are still under study. Our advice: keep and eye on them!

Predictive coding networks — part 1

Recursive generative networks are one possible predictive model. Alternatively, the predictive coding computational neuroscience model can provide prediction capabilities and be arranged as hierarchical deep neural networks.

Here is an example of a 2-layer model:

Rao and Ballard model and Friston implementation compute an error “e” at each layer between “A” modules (similar to D modules above ladder networks) and “R/Ay” modules (similar to G modules of above ladder networks). This error “e” represents, at each layer, the ability of the network to predict the representation. Error “e” is then forwarded as input to the next layer. “R” is a convolutional RNN/LST module, and “Ay” is similar to “A” module. “R” and “Ay” can be also combined into a single recurrent module. In the first layer “x” is the input frame.

The problem with this model is that this network is very different from standard feed-forward neural networks. It does not create a hierarchical representation at higher layer that create combination of features of lower layers, rather, these predictive network compute the representation of residual errors of previous layers.

As such, they are a bit reminiscent of residual feed-forward networks, but in practice forcing these networks to forward errors does not lead them to learn an effective hierarchical representation at higher layers. As such, they are not able to effectively perform other tasks based on upper-layer representations, such as categorization, segmentation, action recognition. more experiments are needed to present these limitations.

This model has been implemented in PredNet by Bill Lotter and David Cox. A similar model is also from Brain Corporation.

Predictive coding networks — part 2

Spratling predictive coding model projects the representation y to upper layers, not the error “e”, as was performed in Friston models above. This make this network model more compatible with hierarchical feedforward deep neural networks and avoid learning moments of errors in the upper layers.

Here is an example of a 2-layer model:

This model can be essentially rewritten and simplified to the Recurrent Generative ladder model we have seen above. This is because “R” and “Ay” can be combined into a single recurrent module.

Relation to generative adversarial networks

Generative Adversarial Networks (GAN) are a very popular model that is able to learn to generate samples from a distribution of data. The new networks model presented here are superior to GAN because:

  • instead of being trained in a minimax game, they are directly trained for a useful task, so both the discriminator and the generator are directly useful
  • they learn to create a useful input representation while also be able to generate new inputs
  • they can learn to generate targeted data based on inputs
  • generator and discriminator network are tied together, removing convergence problems
  • the generator provides almost perfect photorealistic samples (see below), as compared to the unsightly results provided by GAN
Example of CortexNet predictive capabilities — Left: current frame, Center: next frame ground truth, Right: predicted next frame

Note on other models

Models like CortexNet are reminiscent of Pixel recurrent networks and its various implementations (PixelCNN, Pixel CNN++, WaveNet, etc). These model aim at modeling the distribution of input data: (“Our aim is to estimate a distribution over natural images that can be used to tractably compute the likelihood of [data] and to generate new ones.”). They only focus on generating new realistic data samples, but have not shown to learn representations for real-life tasks. These models are also very slow in inference.


These new network are still under study and evaluation. For example the most recent PredNet paper presents a comparison of predictive coding and ladder networks, where PredNet wins on some tasks. PredNet was used to perform orientation-invariant face classification, using higher layer representation. Also it can predict steering angles in a dataset, but mostly using simple motion filter from the first layer of the network. This tasks does not require a hierarchical decomposition of features.