Understanding the letter ‘a’

Published in

BuzzRobot

8 min readNov 1, 2017

“The center problem of AI is the question: What is the letter ‘a’?”
“For any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale artificial intelligence”
— Douglas Hofstadter,
“Metamagical Themas: Questing For The Essence Of Mind And Pattern”

Two research papers seem to be targeted at the same thing appeared recently: the Hinton’s work on Capsules and Vicarious’ paper on Recursive Cortical Network.

Both propose a new kind of architectures aimed at better representing our neural machinery to be able to replace currently industrial leaders — the convolutional neural networks (CNNs).

Both works propose their own path towards general intelligence.

Recursive Cortical Network (RCN)

The Vicarious paper (published in Science, 26 Oct 2017) targets at solving the common sense problem with a new kind of computer vision model called RCN. Vicarious aims to build the models which are compositional, factorized, hierarchical, and flexibly queryable.

RCN does not start from “tabula rasa” and begins with “scaffolding”, prior structure that facilitates model building.

According to the blog:

“RCN is an object-based model that assumes factorization of contours and surfaces, and objects and background. RCN also represents shape explicitly, and the presence of lateral connections allows it to pool across large transformations without losing specificity, thereby increasing its invariance. Compositionality allows RCN to represent scenes with multiple objects while only requiring explicit training on individual objects. All of these features of RCN derive from our assumption that evolution has endowed the neocortex with similar scaffolding that makes it easy to learn representations in our world compared to starting from a totally blank slate.”

RCN is a hierarchical probabilistic generative model for vision in which message-passing based inference handles recognition, segmentation and reasoning in a unified way.

In RCN, objects are modeled as a combination of contours and surfaces. Contours appear at the boundaries of surfaces, both at the outline of objects and at the border between the surfaces that compose the object. Surfaces are modeled using a Conditional Random Field (CRF) which captures the smoothness of variations of surface properties. Contours are modeled using a compositional hierarchy of features. Factored representation of contours (shape) and surfaces (appearance) enables the model to recognize object shapes with dramatically different appearances without training exhaustively on every possible shape and appearance combination.

In order to parse a scene, RCN maintains hierarchical graphs for multiple object instances at multiple locations tiling the scene. The parse of a scene can be obtained via maximum a posteriori (MAP) inference on this complex graph, which recovers the best joint configuration including object identities and their segmentations.

See more technical details in the paper.

**Structure of the RCN.** (A) A hierarchy generates the contours of an object, and a Conditional Random Field (CRF) generates its surface appearance. (B) Two subnetworks at the same level of the contour hierarchy keep separate lateral connections by making parent-specific copies of child features and connecting them with parent-specific laterals; nodes within the green rectangle are copies of the feature marked “e”. © A three level RCN representing the contours of a square. Features at Level 2 represent the four corners, and each corner is represented as a conjunction of four line-segment features. (D) Four-level network representing an “A”.

It seems the right scaffolding helps. It is stated that RCNs are 300x more data efficient than CNNs on a scene text recognition benchmark.

The one particular practical example Vicarious focuses at is solving a text-based CAPTCHA.

The team shows the models has a good capability of solving the task as well as a better generalization comparing to CNNs. In the task of recognizing CAPTCHAs with progressively increasing spacing between letters CNN performance significantly drops, while the RCN performance remain stable (being much more data-efficient at the same time):

Image from Vicarious post, https://www.vicarious.com/2017/10/26/common-sense-cortex-and-captcha/

RCN outperformed other models on one-shot and few-shot classification tasks on the standard MNIST handwritten digit data set. Authors compared RCN’s classification performance on MNIST as they varied the number of training examples from 1 to 100 per category. The one-shot recognition performance of RCN was 76.6% vs 68.9% for CPM (Compositional Patch Model that recently reported state-of-the-art performance on this task) and 54.2% for VGG-fc6.

MNIST classification accuracy for RCN, CNN, and CPM.

As a generative model, RCN outperformed Variational Auto Encoders (VAE) and DRAW on reconstructing corrupted MNIST images:

Examples of reconstructions (A) and reconstruction error (B) from RCN, VAE and DRAW on corrupted MNIST.

Authors’ conclude:

“Our work in the paper is a small step in endowing computers to understand letterforms with the flexibility and fluidity of human perception. Even with our advancements, we are still far from having solved Hofstadter’s seemingly simple challenge of detecting ‘A’s with the same ‘fluidity and dynamism’ of humans. We believe that many of the ideas that we explored in the paper will be important for building systems that can generalize beyond their training distributions like humans do.”

Capsules

The paper by S. Sabour, N. Frosst & G. E. Hinton is accepted at NIPS 2017 and was published on arXiv the same day as the previous one (26 Oct 2017) is called “Dynamic Routing Between Capsules”.

This is a development of Hinton’s idea of capsules introduced previously in the paper “Transforming Auto-encoders” by G. E. Hinton, A. Krizhevsky & S. D. Wang (2011).

One of the places where Hinton talks about capsules is the “What is wrong with convolutional neural nets?” talk given at MIT in 2014. There is a more recent talk on the same topic given at the Fields Institute in Toronto, August 17, 2017. One more deck relevant to this idea.

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. The length of the activity vector represents the probability that the entity exists and its orientation represents the instantiation parameters.

Recent paper introduces an iterative routing-by-agreement mechanism according to which a lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule. This dynamic routing
mechanism ensures that the output of the capsule gets sent to an appropriate parent in the layer above.

A simple CapsNet with 3 layers. This model gives comparable results to deep convolutional networks. The length of the activity vector of each capsule in DigitCaps layer indicates presence of an instance of each class and is used to calculate the classification loss. Wij is a weight matrix between each ui , i ∈ (1, 32 × 6 × 6) in PrimaryCapsules and vj , j ∈ (1, 10).

Dynamic routing can be viewed as a parallel attention mechanism that allows each capsule at one level to attend to some active capsules at the level below and to ignore others.

The authors got a very good result on MNIST using a single model with no ensembling or drastic data augmentation. They get a low test error (0.25%) on a 3 layer network previously only achieved by deeper networks.

Then the authors showed the model (CapsNet) has robustness to affine transformations. The models were never trained with affine transformations other than translation and any natural transformation seen in the standard MNIST. An under-trained CapsNet with early stopping which achieved 99.23% accuracy on the expanded MNIST test set achieved 79% accuracy on the affNIST test set (in which each example is an MNIST digit with a random small affine transformation). A traditional convolutional model with a similar
number of parameters which achieved similar accuracy (99.22%) on the expanded MNIST test set but only achieved 66% on the affNIST test set.

The model can work in a way similar to a Variational Auto-Encoder (VAE). After computing the activity vector for the correct digit capsule, we can feed a perturbed version of this activity vector to the decoder network and see how the perturbation affects the reconstruction:

It is interesting to compare this approach with Spatial Transformer Networks (STN). The STN tries to eliminate viewpoint variation from the activations by giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself. Capsules rather use neural activities that vary as viewpoint varies. This gives them an advantage over “normalization” methods like STN. They can deal with multiple different affine transformations of different objects or object parts at the same time.

Similarly to RCN paper the dynamic routing mechanism allows the model to recognize multiple objects in the image even if objects overlap.

On a MultiMNIST dataset (a generated dataset by overlaying a digit on top of another digit from the same set but different class) a 3 layer CapsNet model trained from scratch achieved higher test classification accuracy than a baseline convolutional model.

The reconstructions show that CapsNet is able to segment the image into the two original digits. Since this segmentation is not at pixel level authors observe that the model is able to deal correctly with the overlaps (a pixel is on in both digits) while accounting for all the pixels:

Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset. The two reconstructed digits are overlayed in green and red as the lower image. The upper image shows the input image. L:(l1, l2) represents the label for the two digits in the image and R:(r1, r2) represents the two digits used for reconstruction.

Hinton concludes:

“Research on capsules is now at a similar stage to research on recurrent neural networks for speech recognition at the beginning of this century. There are fundamental representational reasons for believing that it is a better approach but it probably requires a lot more small insights before it can out-perform a highly developed technology. The fact that a simple capsules system already gives unparalleled performance at segmenting overlapping digits is an early indication that capsules are a direction worth exploring.”

It’s interesting to note that there is another relevant paper called “Matrix Capsules with EM Routing” submitted to ICLR 2018 conference. While being anonymized for the means of double-blind review, it seems that the paper belongs to the same group.

The ICLR’2018 paper highlights several (successfully defeated) architectural deficiencies of the NIPS’2017 paper (like using an unprincipled non-linearity for scaling the length of the pose vector; having cosine distance which is not good at distinguishing between quite good agreement and very good agreement; and using transformation matrices with n² parameters rather than n). For technical details see the paper itself.