Notes on “ Task-Driven Convolutional Recurrent Models of the Visual System”

This recent paper (it’s a pre-print / work in progress) explores how to incorporate useful recurrence in convolutional neural networks (CNNs) in order to boost accuracy on object recognition on natural images.


The authors argue that CNNs, although “quantitatively accurate models of temporally-averaged responses of neurons in the primate brain’s visual system”, do not exhibit the following features found in biological systems:

  1. Local recurrent connections within cortical areas
  2. Long-range feedback from downstream to upstream areas

The ImageNet dataset is used to test the methods in the paper. The argument is that it “… contains many images with properties that might make use of recurrent processing … (e.g. heavy occlusion, the presence of multiple foreground objects, etc).” Furthermore, “… some of the most effective recent solutions to ImageNet … repeat the same structural motif across many layers, which suggests that they might be approximable by the temporal unrolling of shallower recurrent networks …”.

The authors’ attempt at incorporating standard forms of recurrence (e.g., RNNs or LSTM cells) into CNNs didn’t show a significant performance improvement relative to the typical strictly feed-forward structure. This diverges from previous work, in which simple forms of recurrence are able to boost performance on simpler object recognition tasks (e.g., the CIFAR-10 dataset). To overcome this shortcoming, new recurrent cell types (e.g., the “Reciprocal Gated Cell”) were hand-designed to integrate into a baseline CNN architecture. Architecture search was thenused to choose between thousands of local recurrent cells and long-range feedback configurations. The resulting “ConvRNN” network architecture schematic is shown in the figure below (taken from their paper).

Ultimately, these efforts resulted in a recurrent CNN model that matched the performance of the much deeper ResNet-34 network while using only 75% as many parameters. The ImageNet-driven ConvRNN provides “… a quantitatively accurate model of neural dynamics at a 10 millisecond resolution across intermediate and higher visual cortical areas.”


I really enjoyed this paper because it’s able to unify mature ideas from both machine learning and neuroscience and develops some truly great ideas. And, the writing is extremely clear and understandable! I’m excited by the hybridization of convolution and recurrence; although it’s been done before in different capacities, the authors hypothesize good reasons for why their method seems to work so well. One argument is that standard recurrent cells appear to lack the combination of the gating (“… in which the value of a hidden state determines how much of the bottom-up input is passed through, retained, or discarded at the next time step …”) and bypassing (“… where a zero-initialized hidden state allows feedforward input to pass on to the next layer unaltered, as in the identity shortcuts of ResNet-class architectures …”) properties hypothesized of recurrent connections in cortical areas. And, both features are thought to alleviate the problem of vanishing gradients as they are back-propagated to earlier layers.

Clever baseline results are established in order to convince the reader that the work advances the state of the art, since the addition of recurrent layers adds parameters to model, which “… could improve task performance for reasons unrelated to recurrent computation”:

  1. “ Feedforward models with more convolution filters (“wider”) or more layers (“deeper”) to approximately match the number of parameters in a recurrent model …”
  2. “Replicas of each ConvRNN model unrolled for a minimal number of time steps, defined as the number that allows all model parameters to be used at least once.”

An enormous amount of computation is needed for the architecture search: the authors use “… hundreds of second generation Google Cloud Tensor Processing Units (TPUv2s) …” to search over both learning hyperparameters and network architecture choices. However, the search may be worth the effort: it demonstrated that the aforementioned Reciprocal Gated Cell was better to integrate into a ImageNet-classifying CNN than the LSTM cell, increasing classification accuracy and reducing the number of learned parameters.


The following figure from the paper is my favorite:

“… most units from V4 are best fit by layer 6 features; pIT by layer 7; and cIT/aIT by layers 8/9.” The analysis of the artificial network’s dynamics in conjuction with observed neural responses is genius, and I’d really like to see more of this. Given that these models can accurately model real neural dynamics, it should be possible to use them as prescriptive models of said dynamics, making for a cheap way to test hypotheses about the primate visual system.