Hinton’s talk pointed out that the largest source of variance among images of a given object is linear and that max pooling does not represent/capture that linearity. As a simple example, consider viewing a complex object, e.g., a human body, rotating in 3-space. Most small patches of the object will undergo linear transformation most of the time. Moments of discontinuous change, e.g., where one limb occludes another, will be relatively rare. But in max pooling only one unit wins in the relevant feature map at any given time. Thus, the smallest unit of change in the output of a max pooled feature map is a discrete change to a different winning unit. Thus, representing numerous small linear variations requires exponentially many units; otherwise the numerous linear variations are not explicitly represented. In Capsules, the output is the entire activity vector over the capsule’s units; thus, possibly many linearly varying instances of an object can be mapped to different vectors.
Two other properties are of note in the paper and talk. The model’s dynamics (combination of the learning and routing equations) will tend to drive the representation within any one capsule towards localism. That is, the individual underlying dimensions of variation (latent variables, or factors) will tend to become primarily correlated with individual units. In their Figure 4, for example, four of the six units depicted represent single factors. The other two perhaps represent linear combinations of two factors. But that is still localism, just where the representee (the linear combination of factors, e.g., “Scale and thickness”) may not have a symbolic name. Moreover, the capsules themselves are described as representing individual objects (e.g., parts), which is again a localist representation. Thus Capsules appears to operate essentially localistically. Unfortunately, this precludes realizing the exponential efficiency advantage that distributed representation has over localism. Capsules is a localist, compositional model. But as I discuss here, compositionality is not distributedness.
Second, the length of the activity vector over a capsule’s units encodes the probability of presence of the object instance (i.e., the object with its particular setting of factors). This clearly means that at any one moment, exactly one probability can be represented. Formally, exactly one object instance and its probability are communicated to downstream computations, e.g., capsules in the next higher level.
However, there is another model, Sparsey, which solves the problem of being able to simultaneously represent the likelihoods of all hypotheses stored in a coding field and communicate their influences to all downstream computations, including recurrently to the same coding field on the next time step. The key is to represent hypotheses as sparse distributed representations (SDRs), i.e., relatively small subsets of binary units chosen from a much larger field, e.g., 70 out of 1,400. Unlike most other SDR models, Sparsey’s coding field is organized as a set of Q WTA competitive modules (CMs) each having K units. Thus, every code is of size Q, with one winner in each CM, and the code space is K^Q. Sparsey’s coding field has been proposed as an analog of the cortical macrocolumn and the CM, as the analog of the minicolumn (specifically of the L2/3 pyramidal population of the minicolumn): hence biologically realistic values are Q=70 and K=20.
In this setting — i.e., when all the codes stored in the coding field are of fixed size Q — the likelihood of any particular code can be naturally represented by the fraction of its units that are active, i.e., as all or part of the current fully active code. Moreover, the likelihoods of all stored codes can be simultaneously represented in this way. Thus, the fully active code at any moment simultaneously represents both the single most likely hypothesis (at 100% strength) AND the full likelihood distribution over all stored hypotheses. This is concisely explained in Rinkus 2012. Even if a particular hypothesis has only 3 of its units active, out of Q=70 (i.e., ~5% likelihood), those units are still active and will send outputs to downstream computations. Thus, Sparsey constitutes a simple mechanism whereby non-maximally, and even extremely low-likelihood, hypotheses can materially influence the next state of the system, and in the right proportion. The other essential ingredient to this overall concept is a learning mechanism that assigns similar inputs (i.e., object instances) to similar SDR codes, where SDR similarity is measured by size of intersection. Twenty years ago I discovered a method for doing this in fixed time, i.e., in time that remains constant as the number of stored codes (hypotheses) increases (1996 Thesis). Assuming that similarity can be assumed to at least approximately correlate with likelihood, Sparsey then constitutes a model which can update the full distribution over stored hypotheses from one moment to the next in fixed time. This is further elaborated in my recent arxiv paper, “A Radically New Theory of how the Brain Represents and Computes with Probabilities”.
Sparsey does a very general version of ‘routing’ as well, but that will be another post.