Machine Learning Models Are Missing Contracts
Why pretrained machine learning models are often unusable and irreproducible — and what we can do about it
A useful approach to designing software is through contracts. For every function in your codebase, you start by writing its contract: clearly specifying what inputs are expected and valid for that function (the precondition), and what the function will do (the postcondition) when provided an appropriate input. This is often explicitly stated in the docstring of a function. Consider this example from the math module in Python (implemented in C):
The contract in the docstring has two parts:
- Precondition: input should be an integer between 2⁶² and 2⁶⁴
- Postcondition: output is an integer within 1 of the square root of the input
The contract is powerful because when the code is published, other developers do not need to test the function themselves, nor consider its internal implementation. They can read off the range of valid inputs for the function and start using it immediately. Conversely, they operate knowing that if the precondition is not satisfied, then neither is the postcondition guaranteed.
Nowadays, pretrained machine learning models are increasingly being deployed as functions and APIs. They are part of companies’ internal codebases , released externally for use through APIs , and, in research, pretrained models are published as part of the review and reproducibility processes .
A pretrained model is essentially a function: it takes in a particular input sample and makes a prediction as the output. And as users of these models, we need to know what data is valid to feed into the model, and which data can lead to unreliable predictions. It is usually infeasible to test the model ourselves with all possible data, and impossible to inspect the internal implementation, so if we are provided a contract with a clear specification of the valid input data, we can start using the model immediately and confidently. However, specifying “valid data” for machine learning models is much harder than it may sound. Let’s consider a few challenges, illustrated with examples of publicly released models, all taken from this year’s NeurIPS conference:
1. When input data needs to be preprocessed in very specific ways
Paper: SEVIR: A Storm Event Imagery Dataset for Deep Learning
GitHub link: https://github.com/MIT-AI-Accelerator/neurips-2020-sevir
In this paper from MIT, the authors collect a rich dataset of satellite images of storm events recorded over time. Helpfully, the authors also release to the community several pretrained models on this dataset.
I tried to use one of the pretrained nowcasting models (used to predict storm movement patterns over the next hour) by inputting satellite images from the author’s own dataset into the model, and started getting predictions that didn’t make any sense. Normalizing the images in different ways didn’t seem to help. I traced down the example jupyter notebook that the authors had provided and saw these lines of code:
It turns out that some of the channels in the input images were supposed to be rescaled in a very specific way for the model to produce reasonable outputs. This information was not packaged in any way with the pretrained model, even though it is crucial for the model to work properly. In this case (and as is commonly the case), the model did not throw any errors even though the data was not scaled properly. It chugged along, making nonsensical predictions, and I was left to realize on my own that something was wrong.
Information related to preprocessing of the input data should be included in the documentation of pretrained models; otherwise, we are left to figure out for ourselves on how to prepare data for inference. But it is at least conceivable that these preprocessing details could be included in technical documentation. Sometimes, it’s not just preprocessing…
2. When input data needs to come from a specific dataset
Paper: Object-Centric Learning with Slot Attention
GitHub link: https://github.com/google-research/google-research/tree/master/slot_attention
Demo link: https://gradio.app/g/slot_attention
In this paper from Google, the authors release a model that decomposes images of objects into a set of images, each of which contain one individual object. They train models on a specific dataset, CLEVR, which contains synthetic images of simple geometric objects and release the models. Here we run their pretrained model on an example image from the CLEVR dataset:
So far so good. But how general is this model? Can we use it to segment other kinds of objects? Here’s the result when we use an image of furniture: the model does not throw any errors, but it fails horribly:
Okay, so perhaps the model only works with geometric objects. Let’s make a slight adjustment to the original image we used, and increase the brightness of its background. As a user, we would reasonably expect the model to be fairly robust to such simple transformations — certainly our human ability to decompose objects is. However, the model is anything but robust:
I should be clear here — I don’t think that the model is at fault, nor the authors. The model works very well on the task that it was trained and tested on: decomposing objects in the original CLEVR dataset. However, as a user of the pretrained model, I do not know a priori whether the model will handle other kinds of images. There is no contract that tells me the model should only be used with the original images from the CLEVR dataset.
You might be thinking that these models are designed for very specific datasets; they are not meant to be generally used with real-world images. However, the situation can be even worse for models designed to work with natural images…
3. When input data needs to satisfy other hard-to-define constraints
As we shall see, this third example is particularly problematic, in that it is extremely difficult to specify in advance what kind of images the model can accept and where it fails. The authors publish a state-of-the-art super-resolution model that takes in a low-resolution or standard-resolution of an image, and outputs a higher-resolution version. Here’s an example of the released pretrained model working beautifully:
However, let’s try the same model with a different image:
How strange! I certainly was not expecting this blueish image as the output of a state-of-the-art super-resolution model. Certainly, this failure mode would not happen with a classical image processing algorithm for sharpening images. But with this machine learning model, we do not have any idea of the internal implementation, and because there is no contract, we do not know which images we can trust this model with!
After spending hours wrangling with the model, I couldn’t figure out what images the model correctly resolves and which ones produce these strange blue outputs. Because the model is missing a contract, this is a significant waste of engineering effort: I spent a significant amount of time getting the model loaded and running locally only to realize that it won’t work for some of my data.
Conclusion — is there a solution?
We machine learning developers often assume that if we are able to get good performance on our test sets, our models are ready to deploy and release. We do not concern ourselves enough with how downstream users are using models with data that might look quite different from our training and test sets. However, this is happening more and more: as machine learning models are being released as APIs for general use, or are being deployed internally but data streams are changing over time, we can no longer assume that a model’s test performance is indicative of its performance in the real world.
We need to provide contracts that make it clear to users what input data are valid for our models. Otherwise, machine learning models will work properly until they don’t. Systems built on top of machine learning models will fail.
At the same time, we’ve seen that specifying valid input data through explicit instructions is difficult. What can be done? I have two suggestions:
- We develop better methods for determining whether a sample belongs to our model’s “valid data distribution.” Perhaps contracts can themselves be “contract functions,” which assess whether an input sample is valid if it is “similar” enough to our training distribution. Such methods for detecting in-distribution and out-of-distribution data are being developed , though there are caveats to using them in practice .
- We make it easier to test models so that model failures are identified easily and earlier in the model development process. This helps model creators know and communicate the preconditions of their library, and helps end users quickly know if a model is suitable for their use. This is the motivation behind our open-source gradio library. In fact, the failure points identified in the models above were all easily done with the library.
And of course, we can pursue both solutions at the same time — we make it easier to identify model failure points, but also invest in developing contract functions for machine learning models going forward.