What separates us from AI, Part 4: Priors (if you’re not cheating, you’re not trying)

Let’s quickly review what we’ve been through.

  • Part 1: Machine learning falls into a fairly linear hierarchy of a few levels (F0 through F3, for the most part), built from the bottom-up and inferred from the top-down;
  • Part 2: There are a number of theoretically sound techniques for finding ‘optimal’ models, but in practice, none are nearly as data-efficient as humans on certain real-world data sets;
  • Part 3: Humans are able to meta-model (FΩ) the ML hierarchy in a way that software hasn’t/doesn’t.

My assertion in this post is that our ability to ‘beat’ machine learning in terms of data-efficiency is entirely the result of our ability to encode priors at multiple levels of this hierarchy. In other words, we’re better at taking the test because — in some deeper sense — we’ve already seen the answers.

The role of meta-modeling? Building the higher rungs on the ladder gives us a broader canvas on which to paint our priors. Priors at the lower levels, by themselves, just aren’t enough.

So what’s a prior? In mathematical terms, it’s just a ‘starting’ probability distribution that precedes the processing of a given dataset, something of an initial guess (though you can get into trouble trying to anthropomorphize an equation, as I am doing here). Priors are preferences for some regions of a space over another. As you might have guessed, there are priors at every levels of the hierarchy, from likely datasets to likely functions and operators. Priors at the higher levels are needed to constrain the search process at lower levels, helping to make that search more efficient.

Indeed, Lin and Tegmark argue convincingly that the widespread success of deep learning is because it happens to approximate the “priors of the universe”, the particular Hamiltonians which generate the physical data we see and experience all around us.

Interestingly, the kind of data that’s especially troublesome for machine learning is human-generated data — human speech, human language, human behavioral prediction. In other words, data that was constrained by our priors in the first place! It’s pretty intuitive that we would have the biggest advantage over machines here.

There are at least two ways in which humans receive priors — evolution and cultural. We either have priors because the ones hard-coded into our genetic lineage thrived & survived, or because we were directly gifted them by others during our lives (in education, media, interpretive dance, whatever). Of course, we then go on to develop lots of other models through experience, which are posteriors at the time but become priors for future data.

The idea of evolutionary priors is not at all new. Noam Chomsky famously suggested the Language Acquisition Device (LAD) to explain the poverty of stimulus problem — why humans children were so efficient with linguistic ramp-up given that languages are underdetermined by the available data (in other words, efficient search!). This LAD is itself a proposed prior. Interpreted one way, it’s at least a prior on F0, over the distribution of valid sentences — though it might also be a prior on F1 or F2, depending on your perspective.

Indeed, “Nature vs. Nurture” might be considered the battle of evolutionary priors vs. cultural priors.

This is pretty much the only way we can beat machines. We can’t match them on the volume of data perceived. We can’t beat them on the speed of data processed. We can’t generate, search, or evaluate models as quickly as they can. We certainly can’t benefit from advances in materials science and electrical engineering to upgrade our thinking velocity year after year. The one advantage we have, and that seems to explain the performance gap where it exists, is our set of priors.

It’s also the one thing we have been extremely bad at encoding into machine learning software.

This is the missing explanation for why the ‘universal’ approaches (AIXI-mc, AIXI-tl, etc) don’t work as well as advertised — because they’re made to be optimal across all possible data sets (weighed by an idealized information-theoretic prior), not tuned and optimized for the special subset of actual data sets we see in reality. FΩ is itself a human prior, guiding us towards higher-level operators that are more promising than they should be, given the unthinkable dimensionality of the search space.

We’re not consciously aware of our priors — sometimes we act on the basis of them without knowing it, and sometimes we tell stories and rationalize over them, masking them forever.

In that sense, I’m sympathetic to Gary Marcus when he states that we need “top-down models” to complement the bottom-up approach that has seemingly hit a wall. But as I’ve tried to argue in this series of posts, the problem isn’t that deep learning systems don’t have higher-level models. The problem is that they don’t know which models to use.

That’s up to us.


Worth reading? Please clap below.

Many thanks to Glen Ropella (@gepr) of Tempus Dictum for feedback, comments, and general insight.