Learned Multidimensional Indexes

At the ends of papers and talks on the “learned indexes” concept described by Kraska et al in 2017, an exciting future research direction referred to as “multidimensional indexes” is suggested.

I described a multidimensional index concept about 10 years ago (unpublished) but referred to it as learning representations that were “simultaneously physically ordered on multiple uncorrelated dimensions”. The crucial property that allows informational items, e.g., records of a database consisting of multiple fields (dimensions, features), to be simultaneously ordered on multiple dimensions is that they have the semantics of extended bodies, as opposed to point masses. Formally, this means that items must be represented as sets.

Why is this? Fig. 1 gives the intuition. Let there be a coding field consisting of 12 binary units and let the representation, or code, which we denote with Greek letter φ, of an item be a subset consisting of 5 of the 12 units. First consider Fig. 1a. It shows a case where three inputs, X, Y, and Z, have been stored. To the right of the codes, we show the pairwise intersections of the three codes. In this possible world, PW1, the code of X is more similar to the code of Y than to that of Z. We have not shown you the specific inputs and we have not described the learning process that mapped these inputs to these codes. We make only one key assumption: the learning process preserves similarity, i.e., it maps more similar input to more highly intersecting codes. Given this assumption, we know that:

sim(X,Y) > sim(X,Z), and

sim(Y,Z) > sim(X,Z).

Fig. 1

That is, this particular pattern of code intersections imposes constraints on the statistical (and thus, physical) structure of PW1. Thus, we have some partial ordering information over the items of PW1. We don’t know the nature of the physical dimensions that have led to this pattern of code intersections (since we haven’t shown you the inputs or the input space). We only know that there are physical dimensions on which items in PW1 can vary and that that X, Y, and Z have the relative similarities, relative orders, given above.

[N.b.: In this essay, the word “dimension” is synonymous with “latent variable” or “principal component”, not with a “smallest independent element of the input vector”. For example, the actual dimensionality of 12x12 pixel images is 144. But the set of objects that might appear in a particular training set of such images might be well characterized by just a few latent variables, e.g., height, width, horizontal centroid, vertical centroid. So the main point of this essay is that when representations are entities that fundamentally have extension (cannot be reduced to a point entity), as do sets (but not vectors), multiple uncorrelated orderings, i.e., on latent variables, can be simultaneously physically present in the set of such set representations. Also note that in the actual model/technology that spawned this essay, Sparsey, the representations, are sets of substantial size chosen from a much larger field, e.g., 100 units, chosen from 10,000. These are called sparse distributed representations (SDRs). But the much smaller scale, i.e., 5 units chosen from 12, is sufficient for this essay.]

Returning to the example of Fig. 1a, note that even without being given names of the underlying dimensions (latent variables), this model can in fact attach reasonable names to them. That is, knowing only that the learning process that assigns these set representations to input items preserves similarity, one can see just from looking at the intersections that there is some input space dimension on which Y is more similar to X than is Z. Thus, we could call this dimension, “X-ness”. Y has more X-ness than Z does. Similarly, there is another physical dimension present that we can call “Y-ness”, and Z has more Y-ness than X does. Or, we could label that dimension “Z-ness”, in which case, we’d say that Y has more Z-ness than X does. This underscores the foundational importance of unsupervised learning over supervised learning. Only a relatively small amount of supervised learning is needed on top of an unsupervised (or unsupervised + reinforcement learning) base. This notion of assigning names of latent variables based on the names of individual exemplars seems quite consistent with how young children learn.

Now, consider Fig 1b. It shows an alternative set of codes for X, Y and Z, that would result if the world had a slightly different physical structure. Actually, the only change is that Y has a slightly different code, a slightly higher intersection with X’s code (purple units). Thus, the only difference between PW2 and PW1 is that in PW2, whatever physical dimension X-ness corresponds to, item Y has more of it than it does in PW1, as reflected by the fact that |{φ(X) ∩ φ(Y)}|= 3 in PW2, but only 2 in PW1. ALL other pairwise relations are the same in PW2 as they are in PW1. Thus, the representations of the items have the internal degrees of freedom to allow similarity on one dimension to change without affecting similarities on other dimensions. Though this is a small example, this principle scales to higher dimensions, higher numbers of latent variables, and larger numbers of items. Essentially, the principle elaborated here leverages the combinatorial space of set intersections (and intersections of intersections, etc.) to counteract the curse of dimensionality.

The example of Fig. 1 shows that when items are represented as sets, they have the internal degrees of freedom to allow their degrees of physical similarity on one dimension to be varied while maintaining their degrees of similarity on another dimension (more generally, on many other dimensions). We actually made a somewhat stronger claim at the outset, i.e., that items represented as sets can simultaneously exist in physical order on multiple uncorrelated dimensions. Fig. 2 shows this directly, for the case where the items are in fact simultaneously ordered on two completely anti-correlated dimensions.

In Fig. 2, the coding field consists of 32 binary units and the convention is that all codes stored in the field will consist of exactly 8 active units. We show the codes of four items (entities), A to D, which we have handpicked to have a particular intersection structure. The dashed line shows that the units can be divided (for purposes of our analysis) into two disjoint subsets, each representing a different “feature” (latent variable) of the input space, e.g., Height (H) and IQ (Q). Thus, as the rest of the figure shows, the pattern of code intersections simultaneously represents both the order, A > B > C > D, for Height and the anti-correlated order, D > C > B > A, for IQ.

Fig. 2

The units comprising this coding field may generally be connected, via weight matrices, to any number of other, downstream coding fields, which could “read out” different functions of this source field, e.g., access the ordering information on either or both of the two sub-fields, H or Q.

The point of these examples is simply to show that a set of extended objects, i.e., sets, can simultaneously be ordered on multiple uncorrelated dimensions. But there are other key points including the following.

  1. Although we hand-picked the codes for these examples, the model, Sparsey, which is founded on using a particular format of fixed-size sparse distributed representation (SDR), and which gave rise to the realization described in this essay, is a single-trial, unsupervised learning model that allows the ordering (similarity) relations on multiple latent variables to emerge automatically. Sparsey is described in detail in several publications: 1996 thesis, 2010, 2014, 2017 arxiv.
  2. While conventional, localist DBs use external indexes (typically trees, e.g., B-trees, KD-trees) to realize log time best-match retrieval, the set-based representational framework described here actually allows fixed-time (no serial search) approximate best-match retrieval on the multiple, uncorrelated dimensions. Equally crucially, the learning algorithm, which assigns these codes to input items is also fixed-time. And, crucially, there are no external indexes: all the “indexing” information is internal to the representations of the items themselves, i.e., in the patterns of intersections over those representations. Thus, there is no need for these set objects to exist in an external coordinate system in order for the similarity/ordering relations to be represented and used.

Finally, I underscore two major corollary realizations that bear heavily on understanding the most expedient way forward in developing “learned indexes”, and more to the point, learned multidimensional indexes.

  1. A localist representation cannot be simultaneously ordered on more than one dimension. That’s because localist representations have point mass semantics. All commercial databases (DBs) are localist: the records of a DB are stored physically disjointly. True, records may generally have fields pointing to other records, allowing sharing of lower-level records by multiple higher-level records, resulting in hierarchical composite structures. But any record must have at least some portion that is physically disjoint from all other records. The existence of that portion implies point mass semantics and (ignoring the trivial case where two or more fields of the records of a DB are completely correlated) a set of points can be simultaneously ordered (arranged) on at most one dimension at a time. This is why a conventional DB generally needs a unique external index for each dimension or tuple on which the records need to be ordered so as to allow fast best-match retrieval.
  2. In fact, dense distributed representations (DDR), e.g., vectors of reals, as for example present in the internal fields of most mainstream machine learning / deep learning models, also formally have point mass semantics. Intersection is formally undefined for vectors over reals. Thus, any similarity measure between vectors (points) must also formally have point mass semantics, e.g., a Euclidean distance is a scalar. Consequently, DDR also precludes simultaneous ordering on multiple uncorrelated dimensions.

Fig. 3 gives a final illustration of the relationship of viewing items as being represented by points and viewing them as being represented by sets. Here the three stored items are purple, blue, and green. Fig. 3a begins showing the three items as points with no internal structure sitting in a vector space and having some particular similarity (distance) relationships, namely that purple and blue are close and they are both far away from green. In Fig. 3b, we show set representations of the three items consistent with the distance relations of the points in Fig. 3a. There is one coding field here, consisting of 6x7=42 binary units and red units highlight the intersections with the purple item’s representation. Fig 3c shows that the external coordinate system is no longer needed to represent the similarity (distance) relationships, having been replaced by the “internal coordinate system” (the pattern of intersections), and Fig. 3d just reinforces the fact that there is really only one coding field here and that the three codes are just different activation patterns over that single field.

The change from representing information formally as points in an external space to representing them as sets (extended bodies) that require no external space will revolutionize AI / ML.

Fig. 3