AI Hype: What Does Google’s `Underspecification` Bombshell Mean For Machine Learning Credibility?

A practical AI viewpoint on real-world model implementation.

Jonathan Burley
Making AI Make Money
6 min readNov 25, 2020

--

Last week, Google released “Underspecification Presents Challenges for Credibility in Modern Machine Learning”, a paper that has been sending shockwaves through the Machine Learning community.

The paper highlights a particularly thorny problem: even if machine learning models pass tests equally well, they don’t perform equally well in the real world. The bugbears of models failing to meet testing performance in the real world have long been known, but this work is the first to publicly prove and name underspecification as a cause.

We at Foundry.ai, build AI businesses around our machine learning models and are continually deploying ideas to the real world, and wanted to highlight how we’ve been dealing with the (newly named) underspecification problem for the past few years.

However, before we talk about handling underspecification, we need to describe how machine learning models are put together, and what the problem is.

The Underspecification Problem in Machine Learning

Approximately speaking, a machine learning product has three stages from idea to market:

  1. Training on example data to build a model;
  2. Testing that model on data it has never seen before to confirm the model is generally applicable rather than a peculiar fit to the training information; and finally
  3. Real-world use on new data.

This process has a core tenet that good performance on the testing sample means good performance on real-world data, barring systematic changes between testing and the real-world (called data shift or bias); for instance a model forecasting clothing sales after three months of winter learning is likely to struggle come summertime, having learned a lot about coats but very little about shorts. When bias is avoided, this tenet of good testing = good real-world performance is central to ML development.

The Google researchers have taken a sledgehammer to this tenet by proving that best-in-class testing methodologies are not sufficient predictors of real-world performance. After testing, some models will go on to excel in the real-world but some will disappoint, and they cannot distinguish this in advance. For instance, models that have trained on years of clothing sales subsequently having erratic performance in the real world.

The observed behavior has a simple cause: repeating a training process can generate many different models of identical test performance. Each model differs only in the small, arbitrary learning decisions caused by, say, randomly-set initial values or the input order of the training data. These differences are typically considered inconsequential, but it turns out that even after threading the needle of equivalent testing performance, those seemingly incidental changes can cause significant, unpredictable real-world variation.

The reason for this unpredictability is “underspecification”, and it is a difficulty common to the massive model architectures currently in fashion at tech companies (e.g. neural nets for image recognition, recommendation systems, and deep learning NLP). Underspecification occurs when the available testing data can be equally well-matched by many different configurations of the model’s internal computational circuitry. When models have many different ways to get the same result, we can’t know which approach has skill and which approach was luck. The larger the amount of luck in predicting the test data, the greater the range of variation in subsequent real-world uses.

Underspecification has disturbed the machine learning community because it demonstrates that current testing methods (for large-scale models) do not guarantee predictable, equivalent real-world performance.

To be clear, the observed unpredictability is unpleasant but rarely crippling; the train-and-test cycle was sufficient to eliminate purely lucky models, it is just that some successful-in-testing models had more luck than expected. The business and AI community should consider this a warning against hype, but not a refutation of large-scale models.

The Foundry Solution

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk” — Von Neumann, warning against free parameters

Foundry’s core philosophies towards practical AI have been circumventing underspecification for years. We believe in the engineering adage “Keep it Simple” and intentionally limit the free parameters and behavior available to our models. The Von Neumann quote above is something of a term of art for modelers and data scientists, but it amounts to “if you have long enough mathematical equations with enough free parameters to tweak, you’ll fit anything”, which is of course the entire underspecification problem — too many implicit free parameters allow models to mix genuine skillful inductive findings with luck.

Where the inputs to our model (and the internal circuitry to do math on those inputs) necessitate higher complexity with many free parameters, we lean very heavily on causal inference to limit the behavior of the models to the sensible, and look to reduce dimensionality in the inputs.

Limiting model complexity and behavior requires extra work not always regarded as core to the data science toolkit: understanding the real-world process in which your model will embed, speaking with domain experts and front-line users then translating their insights into code. It can sometimes be difficult to identify how such work helps the test-train cycle, but it is rarely difficult to identify how it helps the end product.

In short, every stage of our AI process is focused on the real-world performance rather than the testing performance. Pressuring your teams to be the best-possible real-world performers rather than the best possible testing-performers is a significant change in mindset, but it pays off.

It is an open secret that when AI underperforms in the real world, people lose trust in it and that trust is hard to regain. Executive teams that have spent large amounts of human and financial capital in an AI that claimed every sign of success until it hit the real world are averse to making the same mistake again. This is why pilot projects should be small, precisely scoped, and focused on making ROI within a short time frame[1].

For those in conversation with data scientists, we advise asking the following questions:

  • Why are we using this model architecture?
  • Is there a simpler architecture that should achieve the same result?
  • Can we reduce or combine variables?
  • Can we prove that the applied data reduction has skill?
  • Can the data science team concisely explain the real-world process the model is being applied to?
  • Do the data scientists writing the model have direct and frequent feedback from a domain expert?

Overall, this new paper is part of business-as-usual in a cutting-edge field: excitement at new techniques, an explosion of use-cases and results over a few years, then a tempering of excitement as the practical caveats and warnings are discovered. The research labs are ahead of the curve on exploring techniques, and the entrepreneurial firms are ahead of the curve on skepticism and practical caveats. As an entrepreneurial AI firm, we wanted to release this blog post covering how some of our skepticism and practicality over the past few years** interacts with the latest published findings.

Of course, keeping your machine learning models to strictly necessary complexity is only one part of making AI work in the real world. AI systems need to have the buy-in of end users, data ingestion processes that are robust to future changes, and automated evaluation that ensures continual positive progress. Stay tuned or see some of my colleagues’ existing white papers for thoughts on the packaging around the algorithms.

If you have any questions about practical AI, feel free to reach out to the Foundry team.

**The three main suggestions of the paper to “thoroughly test models on application-specific tasks”, “training models with credible inductive biases”, and “incorporate domain expertise […] with application-specific regularisation […] that approximately respect causal structure” are academic translations of the classic Foundry.ai practical AI presentation for the Global 2000.

LEARN MORE

--

--

Jonathan Burley
Making AI Make Money

Head of Data Science at Actif.ai | PhD computational models | Oxford & Cambridge grad