# Fluid concepts and creative probabilities

## Concept learning through probabilistic programming.

#### Back to the future: concepts in the time of the great A.I. hype

“Then [the term ‘AI’] slid down the slope that ends up in meaningless buzzwords and empty hype. (…) Luckily, a new term was just then coming into currency — ‘cognitive science’ — and I started to favor that way of describing my research interests, since it clearly stresses the idea of fidelity to what actually goes on in the human mind/brain.” — Douglas Hofstadter (before 2000, just to be clear)

Imagine someone came to you with the following puzzle: given the sequence of integers `0, 1, 2, …`, guess the next item. It does not seem that hard, does it? `3` will be our obvious answer as we would assume the “generator function” could be something as simple as `f(x) = x`:

`f(0) = 0f(1) = 1f(2) = 2f(3) = ??? -> 3...`

As it turns out, this is not the function generating this sequence. The sequence continues as follows:

`f(0) = 0f(1) = 1f(2) = 2f(3) = 720!f(4) = ???...`

So, what is `f(4)`?

We won’t spoil the surprise (see the very end for the solution), but use this nice puzzle to discuss a more general problem: given the universe of integers and functions built out of arithmetical operations (addition, subtraction, multiplications, etc.), how can we learn the “generator function” of a sequence given that the number of possibilities is infinite?

[At Tooso we specialize in formal models of human language — if you think solving integer sequences is silly, you can think of English syntax as the “generator function” of what you are reading right now and how hard it is for computers to do human-like inferences on text. Enough about language for now, let’s get back to integers!]

The sequence game came to my mind, serendipitously, while exploring probabilistic programming (PP), which is often explained as an attempt to unify general purpose programming with probabilistic modeling. Our interest in PP lies in the ability to reason explicitly about structures and deal well with uncertainty even without lots of data: good luck with “machine-learning-your-way” out of this:

`f(1) = 1f(2) = 4f(3) = ???`

The sequence game is part of the amazing Fluid Concepts and Creative Analogies (fun fact: it was the first book sold on Amazon!), a 1995 book by A.I. pioneer, best-selling author and cognitive scientist Douglas Hofstadter (DH). In his unparalleled writing ability, this is how the problem gets introduced:

The mathematical origin of this sequence suggests that there will indeed be pattern or order, but as math itself is full of patterns of the most enormous diversity, the field is still wide open. Still, we have built-in prejudices in favor of simplicity and elegance, whether or not we can define those notions. So, given our expectation that we are likely to find a simple and elegant rule, what would we expect to happen next?

In this brief post we will rediscover the sequence game and discuss concept learning with some simple PP programs. On the one hand, it’s going to be a good exercise in understanding PP, as we shall see how to treat built-in prejudices in a satisfactory and principled way; on the other, we shall use PP to get insights into recent progresses in Artificial Intelligence.

DISCLAIMER: this post is not a tutorial on PP nor an academic work (if anything, because it is unapologetically biased) nor a collection of production-ready models — while runnable code samples are included, our interest today lies almost exclusively in the type of modeling that is hard with standard ML tools and comes instead natural with PP. As a bonus, we will get to discuss cognitively plausible ideas about concept learning that go beyond “curve fitting”.

#### Going forward, looking backward: probabilistic modeling 101

“Million to one odds happen eight times a day in New York.” — Penn Jillette

In our life as A.I. practitioners and startup founders, we found multiple times that a lot of ideas in Probabilistic Programming are still far from being mainstream in the data science community. Convinced by this anecdotal evidence, we decided to include an overview of PP and what makes it special for interesting inference problems, such as the ones you meet while teaching language to machines (the reader familiar with PP/WebPPL can quickly skip through the next two sections).

We will make heavy use of WebPPL examples and take full advantage of the possibility of running code in the browser: all examples are fully runnable from this simple HTML page.

Let’s start with a simple example (with dice, of course). If we have two fair dice, what is the probability of getting 8 when you throw them? This WebPPL program (remember: you can run it here) displays the distribution of outcomes in a nice histogram (if something is not clear, don’t panic: it will be obvious soon):

The cool thing about PP is how easy it is to do the inverse inference: given that we observe that the sum is 8, what is the chance of one die being a 2?

As the reader may have suspected, there is a 20% chance of die #1 being a 2 if the total is 8 (quiz for the lazy reader: why there is no 1 in the histogram above?).

While this may look like an overtly simple example, it is really important to appreciate the “power of models”, i.e. simplified versions of the processes generating our data of interest. Models are a very compact representation of the complexity of the world: within a single model, you can in fact reason from cause to effect (forward, when you plan) and from effect to causes (backward, when you make educated guesses). Models are also at the core of our cognitive abilities: by having a “Tangram model”, kids know how to build a duck from scattered pieces (forward) and from the shape of a duck they can guess which pieces are used to build it (backward).

For many people, model-based reasoning is the key to intelligence, with applicability well beyond Tangram (such as intuitive physics and visual learning):

A generative model describes a process, usually one by which observable data is generated, that represents knowledge about the causal structure of the world. These generative processes are simplified “working models” of a domain (…); many different questions can be answered by interrogating the mental model.

For today’s more modest purpose than understanding human intelligence, it is just important to appreciate how PP can help. As we have seen, these models are often called “generative models”: without bothering too much with technical details, what is important for us are the following two considerations:

• generative models tell a story on how data points are generated: they have a structure that can be inspected and they often embody causal assumptions that humans can understand, question, debate;
• generative models can be used to, well, generate new instances of our target data, which is a feature not shared by every Machine Learning model out there. For example, if we train generative models to classify news articles in sport vs finance, we could use them to create new articles on these topics.

While people have been writing generative models by hand for a long time, the golden age of probabilistic modeling just started. The rise of PP solved two major problems for practitioners: first, PP brings to the table the expressivity of programming languages: by using familiar tools data scientists can now express arbitrary complex models instead of a few distributions; second, PP pushes the burden of inference down to the compiler (and related tools): researchers focus on modeling since the computer will solve the required inferential challenge.

#### Express generative models in Javascript: WebPPL 101

“The limits of my programming language mean the limits of my world.” — (almost) Ludwig Wittgenstein

Now that we have some shared context on why generative models are so freakin’ cool, we can go back to the question of how they are implemented in languages such as WebPPL. If you recall our first example, the model is specified as a simple function, wrapped in the Infer method:

`model() {    var die1 = randomInteger(6) + 1;    var die2 = randomInteger(6) + 1;`
`    return die1 + die2;}`

What happens when it runs? Since `enumerate` is the inference method specified, the program will run the function once for all the possible values of the random variables in the model (i.e. random integers from 1 to 6 for each die) and collect the results in a distribution (that gets visualized through the handy `viz` utility). Since the model is generative, it should be possible to sample from it and create new values (i.e. simulate dice rolling): the `sample` function does exactly that, as you can see by adding the following statement to the cell block (it will print an array of 5 integers sampled from the model):

`print(repeat(5, function() { sample(roll); }));`

The second example introduces a key idea of PP: conditioning. The model is the same as before but with a new command:

`condition(die1 + die2 == 8)`

What happens when it runs? `condition` tells the program to ignore runs that do not meet the required condition: this is why no 1 shows up in the distribution — if one die is 1, there is no scenario where the sum is 8 (as 1+6=7).

Conditioning is important for at least three reasons: it lets you reason from observations to causes; it helps stating, in a principled way, information that may constrain the space of program runs to be considered; finally, it can be used to model a very important concept in contemporary A.I., learning, through an 18th century idea known as Bayes’ theorem: your confidence in an hypothesis H given some data D should be related to a) how much you believe in H in the first place and b) how much H can explain D.

We conclude our outrageously short introduction to WebPPL with a slightly juicier example on this topic. Consider a young field linguist — let’s call him Willard — in an exotic country with a very minimal world (a.k.a. ontology):

This exotic world has seven objects (named with integers 1–7) organized in a tree-like structure, with concepts on three levels of generality (e.g. everything is a mammal, dachshund is a type of dog, #3 is a dachshund). When #2 is present, Willard hears native speakers say bassotto (1000 bonus points to readers expecting the natives to say gavagai). What does bassotto mean: dachshund, dog or mammal? More generally:

given an unknown word W and a set of objects denoted by W, how can Willard learn the appropriate level of generality for W?

Unsurprisingly, probabilistic programming is perfectly suited to model this problem, as exemplified by the following sketch:

`learningModel` is where the magic happens. We can distinguish three main parts:

• first, sampling from the hypotheses space: Willard has some prior ideas on what can be named by W (summarized in the distribution `ontologicalTree`); in particular, Willard believes that “natural kinds” like dog and cat are more salient than the very generic mammal;
• second, the likelihood function: Willard knows how to prioritize hypotheses given observations (a set of objects); what matters in the formula is realizing that 1) lower-level concepts (retriever) will assign greater likelihood to the data than higher-level concepts (mammal); 2) as the number of consistent events increases (`observedData.length`), lower-level concepts get exponentially more likely;
• third, actually observing data, as `mapData` takes care of feeding what Willard observes to the scoring function.

[SMALL TECHNICAL NOTE: the original idea for this example comes from the work of Xu and Tenenbaum — see their paper for a bit more context on the psychological facts behind the model and an in-depth discussion on priors and likelihood.]

So, back to our example with #2 observed for the word bassotto, what is Willard thinking? Running the code gives us the following distribution:

It seems a reasonable guess, doesn’t it? Willard prefers dachshund but he is still keeping his mind open; of course, hypotheses inconsistent with the data (such as retriever, whose extension does not include #2) are not considered at all. The scenario gets really interesting when Willard observes #2 + #3 and hear bassotto:

What happened? If Willard was just excluding hypotheses inconsistent with the data, seeing two dachshunds instead of one would not have added meaningful information: simply put, Willard would have not learned anything new. Our probabilistic model allows instead to capture Willard’s learning: after seeing two dachshunds, he is much more confident that bassotto means dachshund, while he is almost ready to rule out mammal as a plausible interpretation (even if entirely consistent with the data, from a purely logical perspective). We leave it as an exercise to the reader to make sure all the code is clear: try changing priors, likelihood and observations to get a better grasp of what is going on behind the scenes!

If you want to toy around some more before continuing, some effective examples of the interplay between forward and backward inference can be found on probmods. While the field is still relatively niche in the data science community, things are moving fast: we discovered this excellent WIP last week and major players are getting into PP. For the convenience of non-lazy readers, we have included a bunch of additional books/links at the end.

Let’s get back to our sequence game now!

#### To “Seek Whence Cometh a Sequence”

“Don’t keep a man guessing too long — he’s sure to find the answer somewhere else.” — Mae West

Now that we have a better grasp on the philosophy behind PP, we can try and model our sequence game as a learning problem in an hypotheses space that is infinite. Given the sequence:

`f(1) = 1f(2) = 4f(3) = ???`

we will try to learn the sequence generator function conditioning (yes!) on the data we have (`f(1)` and `f(2)`), and then apply the function to the next item in the sequence (i.e. 3). Even restricting ourselves to elementary arithmetic functions, there are many functions that are compatible with our observations, such as:

`x * xx ^ xx ^ 2x ^ (0 + x)x * (x / 1)0 + (x ^ x)...`

In other words, this is a very different scenario than our previous examples: there is no finite set of hypotheses we can list and somehow score from the most to the least plausible. This problem is what DH had in mind when writing:

as math itself is full of patterns of the most enormous diversity, the field is still wide open.

Since we cannot list our hypotheses, we need to turn our attention to a process capable of generating infinite hypotheses: luckily for us, our knowledge of languages and grammars comes to the rescue. Consider the simple language thus defined:

`1. A, B, C ... Z are valid expressions2. If X is a valid expression, -X is also a valid expression3. If X and Y are valid expressions, X&Y is also a valid expression4. If X and Y are valid expressions, X^Y is also a valid expression`

How many valid expressions can be generated according to the grammar? The key to answer is to realize that instructions 2–4 take expressions as input and produce new expressions, so that the output of any instruction can become the input of another instruction (or the same one!):

`A -> -A -> -A&B -> -A&B&A -> --A&B&A -> ...`

So even a simple language like the one above is capable of producing a (countably, for the logically-inclined reader) infinite number of expressions.

Armed with this intuition, can we define the space of generator functions through a grammar? Of course we can! We just need to build a small grammar of arithmetic expressions, one that could potentially generate `x * x`, `x ^ x`, `x ^ 2`, etc. and priors over them (remember that the space is infinite, so we need a “generative” way to assign priors over all the possible functions). Here is a first attempt (this gist contains the grammar, while the full runnable code can be found here):

The code should be pretty easy to understand (the original code is from the excellent Goodman & Tenenbaum): with 50% probability, make up an expression by picking an integer (`1–9`) or a variable (`x`), otherwise combine two expressions into a new one (which will recursively pick randomly an integer and so on). We will just stress the “implicit” prior on the model, which is a prior on simplicity: other things being equal, a simpler function (defined now as grammatically shorter) should be preferred. Since fewer probabilistic choices are required to produce `x * x` than `x * (x / 1)`, while extensionally equivalent (as `x = x/1`), the former is considered a much more plausible hypothesis. Obviously, this is the formal equivalent of enforcing in a model what DH called:

[the] built-in prejudices in favor of simplicity and elegance

If we run the model on the sequence:

`f(1) = 1 f(2) = 4`

the preferred suggestion for `f(3)` is 9:

As noted by Goodman & Tenenbaum, “many people find the high probability assigned (…) to `27` to be unintuitive”. This suggests that as a cognitive model of human sequence discovery our priors could use a little reworking: `x times x` seems a more “natural” idea than `x raised to the power of x` (exercise for the non-lazy reader: change the code to reflect this bias and run the prediction again).

What kind of sequences can this model solve? Below there are some examples that can be tested by changing the `condition` statement:

`1, 4, 9 -> ??? x * x0, 3, 8 -> ??? (x * x) - 11, 3, 5 -> ??? x + (x - 1)`

[Exercise for the non-lazy reader: sometime the easiest way to learn generator functions is by taking higher-order sequences, i.e. realizing that `1, 4, 9, 16, 25, ...` produces a child sequence of `3, 5, 7, ...` as differences between pairs of numbers; you can then use the second sequence (the odd numbers) to make prediction about the original sequence. The reader is encouraged to extend the existing model to second order sequences: how does the prior look like in that case?]

The last thing we may want to do is going full circle: how close is the “PP approach” to the original ideas in Fluid Concepts and Creative Analogies? One of DH’s main points throughout the book is that concept representation is the problem of A.I.. If you start with human-specified concepts (like “symbolic A.I.”), you are begging the question of how those concepts are learned in the first place; on the other hand, complete bottom-up representations (at that time, a very early version of what today’s deep learning systems can do in tasks such as computer vision) work for low-level perceptual tasks, but lack most of the generality associated with human concepts. The architectures presented in Fluid Concepts are therefore a mixture of top-down and bottom-up processes, in which lower-level concepts are stochastically combined to build higher-level concepts, which in turn make some other low-level concepts more likely to be used to solve a given task.

When compared to our PP models, the most striking similarities are 1) the probabilistic exploration of an hypotheses space (which is structured through basic concepts) and 2) the application of compositional principles to re-use low-level concepts to build high-level ones.

We strongly invite the reader to get the unfiltered experience from the book as we would never do it justice with our overview. That said, we want to explore one more “fluid task” before the end: let’s get to the (digital) drawing board!

#### Learning spatial concepts through a “visual” grammar

“There are thousands of ways of drawing the letter ‘A’ , yet we recognize any of them at a glance. For any mental category (…), we immediately recognize thousands of instances, despite vast differences.” — Douglas Hofstadter

The last part of Fluid Concepts and Creative Analogies is devoted to Letter Spirit, a project whose goal is “designing and implementing a computer program that could itself design full-fledged typefaces”.

In the characteristic DH’s methodology of using toy problems to enlighten important cognitive phenomena, Letter Spirit is an experiment in human-like machine creativity that produced some pretty interesting font designs:

As humans know well, visual concepts are both very fluid (e.g. there are countless of ways in which we can draw an A), and yet very distinctive (e.g. humans can recognize instantly that some drawings are varieties on a common theme). It is therefore not surprising that teaching machines how to understand visual concepts is viewed fundamentally as the “quest for the conceptual essence of letters”:

To recognize something — that is, to match an item with the proper Platonic abstraction — there must be room for flexing both the Platonic abstraction and the mental representation of the thing itself, to make them meet. What kinds of distortion are sensible? Readjustment of internal boundaries, reassessment of connection strengths, and rearrangement of parts are at the core of the recognition process.

As we have seen times and times again in this post, the probabilistic approach highlights how understanding a concept C and generating new instances of C are indeed intimately related activities. Without attempting to reproduce the full complexity of Letter Spirit, can we discuss some interesting cases of visual learning within the PP paradigm? We will start with a variation on WebPPL computer vision demo: if we draw with n lines a (digital) polygon (triangle, square, pentagon), can a probabilistic program figure out how it was made in the first place?

The basic ingredients should be straightforward by now. First, we define a generative model — a “visual grammar” if you will — of polygons; a possible generative recipe is the following:

• lines are our primitive elements; lines are drawn based on four integers [x1, y1, x2, y2];
• start by sampling from the distribution [3, 4, 5] to pick the number of lines to be drawn; possibly, bias the sampling towards simpler models, i.e. drawing with as few lines as possible;
• if N is the number of lines to be drawn, pick N 4-tuples of integers to get N lines, and draw them on a canvas.

To become familiar with our visual generative model, you can try to create some lines in a separate code block (see a small gist to get you started). At runtime, we then use the power of PP to make the backward inference: given a drawing with such and such features, what are the instructions that could have created it?

To achieve this, we use WebPPL `factor` expression to weigh the iterations based on how close our polygon (generated with the recipe above) is to the target image we are trying to decode. Once we generated our distribution, we can inspect what the program learned about the visual concept of, say, square by repeatedly sampling from it:

`map(drawSamples, repeat(6, function() {sample(m);}));`

and displaying the results:

Not bad! In other words, we can use the model to generate infinite instances of the visual concept we learned: pretty neat, isn’t it? As usual, the non-lazy reader is encouraged to extend and improve the existing code sample (to go where probably no model has gone before!).

As a final exercise, inspired by one of our favorite companies, we added a bunch of ideas to the above visual grammar and completed our own “run-on-my-laptop-only” prototype to showcase the power of PP: no big data needed for training, no black-box optimization, no costly GPUs — everything happens in a standard web browser in a few seconds. Enjoy the video!

#### A random distribution of concluding remarks

“We (…) hope that the work we have described in this book may help lead, in the distant future, to architectures that go much further than we have toward capturing the genuine fluid mentality that Alan Turing so clearly envisioned when he first proposed his deservedly celebrated Test.” — Douglas Hofstadter

Without the presumption of exhausting the threads opened above, the following are some parting thoughts and more philosophical reflections on our journey so far in concept-driven Artificial Intelligence:

• it may be good to discuss a bit more explicitly where PP stands on hot topics compared to the current A.I. big trend in deep neural nets. In our opinion, probabilistic programming promises to be a principled approach that learn with less data and act in a more human way, i.e. both closer to what we know about human cognition and, even more importantly, easier to understand: while a lot has been done in recent times on the problem of explainability in deep learning models, it’s safe to say PP models have an edge here. Deep learning has gained incredible popularity by promising to eliminate tedious feature engineering in task such as image classification, i.e. by learning automatically useful representations; while that is true generally, in many real world cases, human experts do know a lot of relevant information about the inference at hand and PP has a straightforward way to include those data into the formal model: in a world that is still not ready (nor prepared) to provide you with a fully automated, sci-fi life, being able to collaborate in a shared language with your enterprise software applications is a quantum leap forward; as we have seen, probabilistic modeling is a very natural way to bridge the gap between human-understandable ideas and machine-learned knowledge.
• There are some very interesting attempts to apply PP to topics in computational semantics and pragmatics. At Tooso we are working on PP models for knowledge graphs, i.e. leveraging an implicit ontological structure to do quick and reliable inference for natural language queries. The sky is the limit though, as PP promises to marry two principles of language understanding that have been hard to reconcile up until now: first, language is compositional, as complex expressions are (kind of) predictably made out of simpler ones; second, communication is noisy, so that rigid inferences (e.g. purely deductive) will fail in almost all real-world use cases. Food for thoughts for the really-not-lazy reader with some free time (friends in academia, anyone?): counterfactual statements ask us to reason by conditioning on facts that did not happen (“if the first die was 6, the sum would not have been 5”) — is there some sort of PP (toy) model of communication that can make justice to that (the link between Bayesian modeling and counterfactuals comes up often in the relevant literature, but we are not aware of application specific to counterfactual semantics and we didn’t google much)?
• We mentioned Goodman & Tenenbaum quite a lot through the post, which certainly would have not been possible without their web book Probabilistic Models of Cognition. If you like to play around a bit more with concept learning, you should definitely check out the Rational Rules example; if you found the visual challenge of Letter Spirit particularly interesting, it should not surprise you that a recent work by Tenenbaum et al. addresses a very similar problem from a PP perspective: probabilistic program induction is one of the most exciting paradigms we have seen in recent years — and the best is surely (well, very likely) yet to come.
• We mentioned Douglas Hofstadter quite a lot through the post, which certainly would have not been possible without his book Fluid Concepts and Creative Analogies. As a final nostalgic note, please allow us to state how much we miss DH! He is mostly absent from mainstream media, not heavily involved in Twitter fights (O tempora! O mores!) and not cited much in “contemporary A.I.”. It’s a pity, as Hofstadter’s ideas are still thought-provoking and can provide a much needed original perspective on the field as a whole, in its titanic quest to better understand intelligence.
• The dachshund example (bassotto is the Italian word for dachshund) is in loving memory of my own, Omero.

If you survived this incredibly long blog post (including weird references, nerd papers and clumsy code samples), you deserve our deepest gratitude.

We do hope that, whatever your prior beliefs were, observing the data in this article convinced you to follow more closely what we are building at Tooso.

#### See you, space cowboys

If you have requests, questions, feedback or crazy ideas, please share your A.I. story with jacopo.tagliabue@tooso.ai.

Don’t forget to get the latest from Tooso on Linkedin, Twitter and Instagram.

#### Acknowledgments

Special thanks to Andrea Polonioli for brave and awesome story-telling and thanks to the entire Tooso team for helpful comments on previous drafts of this article; we are also grateful to the organizers of The Future of AI meet-up for letting us share there some of these ideas. All the remaining mistakes are mine and mine only.

Our probabilistic programming intro sucks especially on the “programming” part, as we could just introduce the very minimum of WebPPL needed to drive our points home. Where to go from here? Unfortunately, there is not much out there as a full-fledge, from-scratch, “computer scienc-y” introduction to PP in scripting languages (yes, I do mean Python):

• Pyro has been open-sourced recently, but the community is still very small (which means: no copy+paste from StackOverflow): porting WebPPL models to Pyro is a good (albeit sometimes frustrating) way to get started;
• this “For Hackers” small book is fun and very hands-on but does not get to the more abstract applications for A.I. (e.g. concept learning);
• on the opposite side of the spectrum, Probabilistic Models of Cognition is absolutely fantastic as a science-based introduction to PP, but it’s Javascript based (DISCLAIMER: we love basically everything Goodman & Tenenbaum do, so we may also be a bit biased).

In the end, our suggestion is to just get lost in the nice examples and plenty of research papers here, here and here (this repo is amazing, but most models are not in WebPPL: if you like natural language semantics, start with this). The field is evolving quickly though: if you think we forgot something, please let us know.

#### Solution to the 720! puzzle

Given the sequence:

`f(0) = 0f(1) = 1f(2) = 2f(3) = 720!...`

what is the generator function? This is the solution:

`0 -> 0 (0!)1 -> 1!2 -> 2!!3 -> 3!!! (as 3! = 6 and 6! = 720)4 -> 4!!!!...`

that is, for every x, x-times factorial x (see Fluid Concepts and Creative Analogies, p. 46).

Originally published at towardsdatascience.com on November 5, 2018.