RCN

Understanding the Inference Mechanism of RCNs

Adapting the generative model for classification to break CAPTCHA

Published in

The Startup

4 min readFeb 2, 2020

If you take a look at Stanford’s AI Index report of 2019, you will notice that the performance of models on famous challenges is starting to saturate [1]. For that reason, I believe that we need to shed a light on new ideas to advance deep learning even further than it has reached. And, one of the fields that I think should eagerly strive for progress is computer vision because, as Fei-Fei Li said, understanding vision is really understanding intelligence [2].

The new idea we explore in this column of articles is Recursive Cortical Networks (RCNs) [3], the brainchild of the company Vicarious which has attracted the attention of investors like Musk & Zuckerberg [4]. In this article, we talk about how RCNs are adapted to perform classification rather than generation.

An RCN model classifying the letter ‘A’ | taken from Vicarious’ Blog

This article assumes an understanding of the structure of RCNs and how they perform generation which you can gain from reading the supplementary material of [3], or from reading my previous article.

Single Object Detection

Given an image with a certain object in it, we want the RCN to tell us which class the object belongs to. RCNs achieve that by posing the question: if we assume that the input image was actually generated by the RCN, which channel in the topmost layer is the most likely channel to have generated it? In RCNs, answering that will not only identify the class of the object but also its location. To answer it, we could try to build a joint probability distribution model of all the states of all the random variables (the channels) we have in the RCN, a model that computes the probability of each certain full assignment of states to all the channels. Having that, we would condition the model on the input image, find the full assignment that has the maximum probability, and then find the channel in the topmost layer that is set to ‘true’.

However, that is intractable since we have a ton of random variables. Luckily, RCNs are inherently graphical models which means that they exploit the conditional independence structure between their random variables to make that task more tractable. To be more clear, we need to notice that a channel can directly know the state it should be in (e.g true or false in case of feature channels) if it knows the state of the channels that are directly wired to it; it doesn’t need to know the state of all channels in the RCN. In other words, a channel is conditionally independent of all channels given the channels that it is wired to. This saves us from a huge amount of computation that a full joint distribution model would have needed.

There are a lot of efficient algorithms that are able to exploit the conditional independence we just talked about, namely belief propagation algorithms which the authors use. Explaining belief propagation is out of the scope of this article and isn’t a new thing, but the gist of it is that we are going to propagate information (called messages) from the bottom-most layer, since the evidence is there, to the top-most layer, layer by layer. The information each channel sends is just a single number for each state its parent can be in. This number doesn’t have much of a meaning in my opinion; it came from simple algebraic manipulations of the equations required to get the max probability state of a certain random variable in the graphical model. So, running a forward pass of the algorithm (bottom-up) will give us the probability of each state of each channel in the top-most layer. Choosing the channel with the highest (state = ‘true’) probability answers our question.

Object Reconstruction

Unlike ConvNets, RCNs can naturally handle multiple objects in the scene and also reconstruct them. For reconstruction, we are going to make a belief propagation backward pass from the channel of the detected class in the top-most layer to the channels in the bottom-most layer. This time, however, the backward pass messages will be able to give us the most likely assignment to all the channels in the RCN, a global approximate MAP solution. If we have that, we would be able to know which channels in the bottom-most layer should be in (state = ‘true’) and thus construct an edge map using their patch descriptors. This will construct the whole object even if a part of it is occluded.

Multi-Object Detection

For detecting multiple objects in the scene, instead of just picking the most probable channel in the top-most layer after the forward pass, we want to pick a set of candidate channels that best explains the image. The authors develop a scene scoring function that can score the reconstruction done by a set of candidates which makes us able to compare between sets. They acquire the best set of candidates by finding the set that optimizes the scoring function using an approximate dynamic programming method.

That’s all for this article. If you want to learn more about RCN, you can check its paper [5] and the accompanying supplementary material document, or you can read the rest of my articles talking about the learning and the results of applying RCNs on different datasets.

References

[1] R.Perrault, Y. Shoham, E. Brynjolfsson, et al., The AI Index 2019 Annual Report (2019), Human-Centered AI Institute — Stanford University.

[2] M. McNeal, Fei-Fei Li: If We Want Machines to Think, We Need to Teach Them to See (2015), Wired.

[5] D. George, W. Lehrach, K. Kansky, et al., A Generative Vision Model that Trains with High Data Efficiency and Break Text-based CAPTCHAs (2017), Science Mag (Vol 358 — Issue 6368).

[4] J. Mannes, Vicarious Gets Another $50 Million to Expand its Research Team and Build Smarter Robots, TechCrunch.

RCN