Understanding Model Uncertainty
Every model has uncertainty. The model learns from imperfect or incomplete information, which impacts decisions about the “best” algorithm, hyperparameters, and features.
This fundamental uncertainty has three sources:
- Assumptions inherent to the algorithm. No algorithm can perfectly model a real-world problem, there is always some error due to the fact that the model learns generalizations.
- Noise in the data. Every real-world data set always has some degree of randomness. This leads to what we call aleatoric uncertainty, or statistical uncertainty.
- Some things are knowable but may not be represented in the training data due to incomplete coverage of the problem domain. This leads to epistemic uncertainty.
While the assumptions made by a model are algorithm-specific, the latter two sources of uncertainty relate to the data. We can look at some examples to gain a better understanding of them.
Aleatoric Uncertainty
Aleatoric uncertainty stems from noise in the data. There is always some amount of randomness in any real-world data set.
Even in a carefully controlled scientific experiment, data is collected through indirect measurement — that is, the measurement is taken using equipment. This equipment has a level of imprecision, as well as potential sources of error, and the thing being measured often cannot be completely isolated. If you’ve ever seen Breaking Bad, think of how Walter White is proud of 99.1% purity. Even he can’t achieve 100% purity; it’s impossible. Similarly, in physics there is the particle-wave duality of light — whether light behaves like a particle or a wave depends on how it is being observed. These indirect measurements lead to noise.
Outside of scientific experiments, an easy example to consider is the photo above depicting a sidewalk-lined street partially covered by a building’s shadow. The shadow creates a distinct straight edge in the road while reducing the contrast at the boundary between sidewalk and street. This adds noise to the image. If a model were attempting to segment the photo, it may detect the shadow as an edge and incorrectly bound the street even though to a human eye we can easily see the actual boundary.
For any data set, there is noise from errors in measurement as well as from inherent randomness. In something as seemingly straightforward as a house sale price, although a bank will use a specific formula for calculating the value, there are human factors external to the physical properties of the house that can add noise to the final negotiated price. For an NLP problem, we may have noise from the errors people make when speaking and writing such as using the wrong word or making a grammar mistake, but there will also be noise stemming from the fact that two people saying the same thing will likely phrase it slightly differently.
Although a sufficiently large data set should provide coverage of the range of values that is possible within the domain, the inherent noise will always add uncertainty to the model.
Epistemic Uncertainty
Beyond the noise in the data, we have uncertainty from the fact that training data is always a sample. It will never be all of the data (if it were, you wouldn’t have new data to make predictions for). This gives us epistemic uncertainty, the uncertainty derived from what we don’t know but could learn.
As you may have guessed, using more data reduces epistemic uncertainty. It cannot be completely eliminated since there will always be more data to observe, but we factor it into our model evaluation by using holdout test data or cross validation, and reinforcement learning or model retraining can further reduce it over time.
Conclusion
Model uncertainty can be a challenging concept to grasp, especially when first starting out in data science. We want there to be a correct answer, a correct model, but because of fundamental uncertainty, the correct model doesn’t exist. The sources of uncertainty are thus important to consider when we evaluate performance and in determining how to use the outputs.