Predicting used cars price distributions with Monte Carlo Dropout

What if, instead of a pointwise prediction, you get a whole distribution of prices?

Willian Werner
Neuronio
6 min readSep 2, 2019

--

Photo by Carlo D'Agnolo on Unsplash

Introduction

A little approached field from machine learning is certainly uncertainty estimate. There are lots of methods for model interpretability [check this post], but few about error estimation.
Of course, most of ML applications require just a pointwise prediction. But I'm here to show an example of where variation matters.
Suppose, for example, that you're negotiating used cars and you'd like to get a more data-driven estimation for prices. You may use a regression model to predict cars prices based on its features. But single value prediction may be misleading: the car may be sold in a whole range of prices, and fixing a single price may lead to bad businesses. In this cases it's useful to predict deviations too.

Risk vs. Uncertainty

There are multiple definitions for the distinction between risk and uncertainty. The most famous comes from Frank Knight, that states:

Uncertainty must be taken in a sense radically distinct from the familiar notion of Risk, from which it has never been properly separated. The term "risk" as loosely used in everyday speech and in economic discussion, really covers two things which, functionally at least, in their causal relations to the phenomena of economic organization, are categorically different. […] The essential fact is that "risk" means in some cases a quantity susceptible of measurement, while at other times it is something distinctly not of this character; and there are far-reaching and crucial differences in the bearings of the phenomenon depending on which of the two is really present and operating. […] It will appear that a measurable uncertainty, or "risk" proper, as we shall use the term, is so far different from an unmeasurable one that it is not in effect an uncertainty at all. We […] accordingly restrict the term "uncertainty" to cases of the non-quantitive type.

Here we’ll use a variant from Ian Osband:

[…] We identify risk as inherent stochasticity in a model and uncertainty as the confusion over which model parameters apply. For example, a coin may have a fixed p = 0.5 of heads and so the outcome of any single flip holds some risk; a learning agent may also be uncertain of p.

According to that definition, we may call risk aleatoric uncertainty, as uncertainty may be called epistemic uncertainty. The main point of this distinction is that we can reduce epistemic uncertainty by gathering more data, whilst aleatoric uncertainty cannot. Since here data is either generated or gathered from large databases, we assume all uncertainty to be aleatoric.

Random variable y, in function of ε and x

Sample Data

We’ll use a simulated example for a heteroskedastic random variable Y. Here H stands for Heaviside function. Figure below shows a sample from Y:

Sample of size 1000 from Y
Distribution graph for x, ε and y.

Simple Model

We'll use the following Keras model as base for uncertainty approaches.

Example model

Just for illustration, the following plot has the results of the model regression. It's important to note that, at the extreme regions of X, the model doesn't have much knowledge about the function behaviour.

Regression for data

Great job! But these results shows nothing about uncertainty. Let's discover some strategies for error estimation.

Strategies for output distribution

In all the following cases, there are some parameters θ (model parameters, such as the neural network weights), and some data D = {x, y}. For output distribution, we need to know P(θ|D), i.e., the probability of having such weights given the data. By Bayes Theorem, we have

Posterior distribution for parameters equation

Generally, the last integral is either difficult or expensive to calculate. So the following methods will be able to approximate the posterior distribution.

#1. Markov chain Monte Carlo — Metropolis-Hastings

Markov chain Monte Carlo (MCMC) is a group of methods to generate a sample from a unknown distribution. They combine Monte Carlo techniques for sampling from a generated distribution and Markov chains to calculate the probability of each value in the distribution.

Between MCMC methods, the most famous is Metropolis-Hastings, represented by the following algorithm:

  1. Get a base distribution f and a candidate distribution g. Generally f is the product P(D|θ)P(θ), which is well known; as g is a simple distribution (such as gaussian). As we iterate, g will accumulate more samples, and become more similar to P(θ|D)
  2. For each iteration t:
    2.0. we have x_t as the previous sample from g
    2.1. get a sample x0 from g(x0|x_t)
    2.2. calculate the acceptance ratio α = f(x0)/f(x_t). It gives an idea of how probable x0 is among the previously gathered samples.
    2.3. with probability α, insert x0 to the gathered samples. Else reinsert the previous sample x_t.

This algorithm is proven to asymptotically approximate P(θ|D). However, it is still extremely expensive computationally, being sometimes infeasible. For that reason, I wasn't able to compute this algorithm for a DNN (because it has many parameters). So, I've changed the model to a 3-degree polinomial. I've also used a framework called pymc3, for applying Monte Carlo methods. Obs.: as we use Metropolis-Hastings, the sample function doesn't get just samples, it updates the parameter values (analogous to a fit method).

Estimated values for each parameter of polynomial regression
Values for regression

#2. Variational Inference

Here we'll approximate P(θ|D) by generating a distribution Q(θ), using the following divergence metric: The Kullback-Leibler divergence. Its formula is given as follows:

Results for VI regression

Pretty bad, huh? That's because (1) the DNN isn't well optimized, it would take more several training steps to get accurate and (2) variational inference is susceptible to get stuck in local optima.

#3. Monte Carlo Dropout

Methods #1 #2, although not working well for DL, are well consolidated for little data (standard statistics). In 2016, this paper showed that using dropout in every layer at prediction time is a guaranteed way to determine uncertainty, i.e., is equivalent to a Monte Carlo process.

Sample of 10 predictions with dropout
Prediction w/error based on sampling from dropout model

Application — Predicting Car price Distribution

I've used a Kaggle dataset that contains a list of used cars, its prices, and features as power, how many times the car was sold, mileage, etc. Here is some preprocessing:

And then we train using the following model:

Training model for used car price

Results

Here are some predicted values distributions. We get a sample from the model with price predictions outputs. Then we check it vs. the real price predicted.

Conclusion

There are several (not so good) more approaches for estimating uncertainty not discussed here, some of them useful for deep learning too.

You can check the complete code in the link below!

References

--

--