TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

All deep learning is statistical model building

Tom Charnock
TDS Archive
Published in
16 min readAug 3, 2020

--

Image by Author

Deep learning is often used to make predictions for data driven analysis. But what are the meanings of these predictions?

This post explains how neural networks used in deep learning provide the parameters of a statistical model describing the probability of the occurrence of events.

The occurrence of events and aleatoric uncertainty

Data, observables, events or any other way of describing the things we can see and/or collect is absolute: we roll two sixes on a pair of six-sided dice or we get some other combination of outcomes; we toss a coin 10 times and we get heads each time or we get some other mixture of heads and tails; our universe evolves some way and we observe it, or it doesn’t — and we don’t. We do not know, a priori, whether we will get two sixes with our dice roll or heads each time we toss a coin or what possible universes could exist for us to come into being and observe it. We describe the uncertainty due to this lack of knowledge as aleatoric. It is due to fundamental missing information about the generation of such data — we can never exactly know what outcome we will obtain. We can think of aleatoric uncertainty as not being able to know the random seed of some random number generating process.

We describe the probability of the occurrence of events using a function, P‏‏‎ ‎:‏‏‎ ‎d‏‏‎ ‎∈‏‏‎ ‎E‏‏‎ ‎‏‏‎ ‎P(d)‏‏‎ ‎∈‏‏‎ ‎[0 ,1], i.e. the probability distribution function, P, assigns a value between 0 and 1 to any event, d, in the space of all possible events, E. If an event is impossible then P(d)‏‏‎ ‎=‏‏‎ ‎0, whilst a certain outcome has a probability P(d)‏‏‎ ‎=‏‏‎ ‎1. This probability is additive such that the union of all possible events d‏‏‎ ‎∈‏‏‎ ‎E is certain, i.e. P(E)‏‏‎ ‎=‏‏‎ ‎1.

Using a slight abuse of notation we can write d‏‏‎ ‎~‏‏‎ ‎P, which means that some event, d, is drawn from the space of all possible events, E, with a probability P(d). This means that there is a 100×P(d)% chance that event d is observed. d could be any observation, event or outcome of a process, for example, when rolling n = 2 six-sided dice and obtaining a six with both, d = (d¹ = 0, d² = 0, d³ = 0, d= 0, d= 0, d= 2). We do not know, beforehand, exactly what result we will obtain by rolling these two dice, but we know there is a certain probability that any particular outcome will be obtained. Under many repetitions of the dice roll experiment (with perfectly balanced dice and identical conditions) we should see that the probability of d occurring is P(d)‏‏‎ ‎≈‏‏‎ ‎¹/₃₆. Even without performing many repetitions of the dice roll we could provide our believed estimate of the distribution of how likely we are to see particular outcomes.

Statistical models

To make statistical predictions we model the distribution of data using parameterisable distributions, Pₐ. We can think of a as defining a statistical model which contains a description of the distribution of data and any possible unobservable parameters, v Eᵥ, of the model. The distribution function then attributes values of probability to the occurrence of observable/unobservable events Pₐ : (d, v) (E, Eᵥ) ↦ Pₐ(d, v) ∈ [0, 1]. It is useful to note that we can write this joint probability distribution as a conditional statement, Pₐ = Lₐ · pₐ = ρₐ · eₐ. These probability distribution functions are:

  • The likelihood — Lₐ : (d, v) (E, Eᵥ) ↦ Lₐ(d|v) ∈ [0, 1]
  • The prior — pₐ : v Eᵥpₐ(v) ∈ [0, 1]
  • The posterior — ρₐ : (d, v) (E, Eᵥ) ↦ ρₐ(v|d) ∈ [0, 1]
  • The evidence — eₐ : d E eₐ(d) ∈ [0, 1]

The introduction of these functions allow us to interpret the probability of observing d and v as being equal to the probability of observing d given the value, v, of the model parameters multiplied by how likely these model parameter values are — likewise, it is equal to the probability of the value, v, of model parameters given that d is observed multiplied by how likely d is to be observed in the model.

For the dice roll experiment we could (and do) model the distribution of data using a multinomial distribution, Pₐ = ∏ᵢ n!/dⁱ! pᵢᵈⁱ where the fixed parameters of the multinomial model are v = {p₁, p₂, p₃, p₄, p₅, p₆, n} = {pᵢ, n| i ∈ [1, 6]} with pᵢ as the probabilities of obtaining value i ∈ [1, 6] from a die and n as the number of rolls. If we are considering completely unbiased dice then p₁ = p₂ = p₃ = p₄ = p₅ = p₆ = ¹/₆. The probability of observing two sixes, d = (d¹ = 0, d² = 0, d³ = 0, d= 0, d= 0, d= 2), in our multinomial model with n = 2 dice rolls can therefore be estimated as Pₐ(d) = ‎¹/₃₆. Since the model parameters, v, are fixed this is equivalent to setting the prior to pₐ = δ(pᵢ − ¹/₆, n − 2 ) for i ∈ [1, 6] such that Lₐ = ∏ᵢ 2(¹/)ᵈⁱ/dⁱ! or 0.

Of course we could build a more complex model where the values of the pᵢ depended on other factors, such as the number of surfaces which the dice could bounce off, or the strength which they were thrown or the speed of each molecule of air which hit the dice at the exact moment that they left our hands or an infinite number of different effects. In this case the distribution Pₐ would assign probabilities to the occurrence of data, d ~ P, dependent on interactions between unobservable parameters, v, describing such physical effects, i.e. in the multinomial model, the values, v, of model parameters would change the value of the pᵢ describing how likely d is. However, we might not know exactly what values these unobservable parameters have. Therefore Pₐ describes not only an estimation of the true distribution of data, but also its dependence on the unobservable model parameters. We call the conditional distribution function describing the probability of observing data from unobservable parameters the likelihood, Lₐ. Since the model a describes the entire statistical model, the prior probability distribution, pₐ, of model parameters, v, is an intrinsic property of the model.

The fact that we do not know the values, v, of parameters in a model a (and there is even a lack of knowledge about the choice of model itself) introduces a source of uncertainty which we call epistemic — the uncertainty due to things that we could learn about via the support from observing events. So whilst there is an aleatoric uncertainty due to the true random nature of the occurrence of events d from the distribution of data, P, there is also an epistemic uncertainty which comes from modelling this distribution with Pₐ. The prior distribution, pₐ, should not be confused with the epistemic uncertainty, though, since the prior is a choice of a particular model a. An ill-informed choice of prior distribution (due to definition of the statistical model) could be used that does not allow the model to be supported by the data.

For example, for the dice rolling problem, when we build a model, we could decide that our model was certain and that the prior distribution of the possible values of the model parameters is pₐ = δ(pᵢ − ¹/₆, n − 2 ) for i ∈ [1, 6]. In this case the epistemic uncertainty would not be taken into account because there is assumed to be nothing that we can learn about. However, if the dice were weighted such that p₁ = 1 and p₂ = p₃ = p₄ = p₅ = p₆ = 0, then we would never get two sixes and there would be no support for our model from the data. Instead, if we choose a different model a′ which is a multinomial distribution but where the prior distribution on the possible values of the pᵢ are such they can take any value from 0 to 1 under the condition that ∑ ᵢ pᵢ = 1 then there is no assumed knowledge (within this particular model). There is, therefore, a very large epistemic uncertainty due to our lack of knowledge, but this uncertainty can be reduced via inference of the possible parameter values when observing the available data.

Subjective inference

We can learn about what values of the model parameters are supported by the observed data using subjective inference (often called Bayesian inference) and hence reduce the epistemic uncertainty within our choice of model. Using the two equalities of the conditional expansion of the joint distribution, Pₐ, we can calculate the posterior probability that, in a model a, the parameters have a value v when some d ~ P has been observed as

This posterior distribution could now be used as the basis of a new model, a′, with joint probability distribution Pₐ′, where pₐ′ = ρₐ, i.e. Pₐ′ = Lₐ · pₐ′. Note that the form of the model hasn’t changed, just our certainty in the model parameters due to the support by data — we can use this new model to make more informed predictions about the distribution of data.

We can use MCMC techniques, amongst a host of other methods, to characterise the posterior distribution, allowing us to reduce the epistemic uncertainty in this assumed model. However, rightly or wrongly, people are often interested in just the best fit distribution to the data, i.e. finding the set of v for which Pₐ is most similar to P.

Maximum likelihood and maximum a posteriori estimation

To fit the model to the true distribution of data we need a measure of distance between the two distributions. The measure most commonly used is the relative entropy (also known as the Kullback-Leibler (KL) divergence)

The relative entropy describes the information lost due to approximating P(d) with Pₐ(d, v). There are some interesting properties about the relative entropy which prevent it being ideal as a distance measure. For one it isn’t symmetric, D(P Pₐ) D(Pₐ P), and thus it cannot be used as a metric. We can take the symmetric combination of D(Pₐ P) and D(P Pₐ) but problems still remain, such as the fact that P and Pₐ have to be defined over the same domain, E. Other measures, such as the earth mover distance, may have the edge here since it is symmetric and can be defined on different coordinate systems (and can now be well approximated using neural networks when used as arbitrary functions rather than being used for predicting model parameters). However, we still most often consider the relative entropy. Rewriting the relative entropy we see that we can express the measure of similarity as two terms

The first term is the negative entropy of the distribution of data, i.e. the expected amount of information which could be obtained by observing an outcome, directly analogous to entropy in statistical thermodynamics. The second term is the cross entropy, H(P, Pₐ), which quantifies the amount of information needed to distinguish a distribution P from another distribution P, i.e. how many draws of d ~ P would be needed to tell that d was drawn from P and not from P∈. Noticing that there is only one set of free parameters, v, in this form of the relative entropy then we can attempt to bring Pₐ as close to P as possible by minimising the relative entropy with respect to these parameters

How do we actually do this though? We might not have access to the entire distribution of possible data to do the integral. Instead, consider a sampling distribution, s : d S s(d) ∈ [0, 1], where s(d) is the normalised frequency of events from sampling space with N conditionally independent values of d, S E. In this case the integral becomes a sum and H(P, Pₐ) ≊ − ∑ s(d) log Pₐ(d, v). Using the conditional relations we then write Pₐ = Lₐ · pₐ as before and as such the cross entropy is H(P, Pₐ) ≊ − ∑ s(d) log Lₐ(d| v)−∑ s(d)log pₐ(v). Since the prior is independent of the data it just adds an additive constant to the cross entropy.

On a different note, we can write the likelihood as the product of probabilities given the frequency of occurrences of data from the sampling distribution

So, besides the additive constant due to the prior, the cross entropy is directly proportional to the negative logarithm of the likelihood of the data in the model. This means that maximising the logarithm of the likelihood of the data with respect to the model parameters, assuming a uniform prior for all v (or ignoring the prior) is equivalent to minimising the cross entropy, which can be interpreted as the minimising the relative entropy, thus bringing Pₐ as close as possible to P. Ignoring the second term in the cross entropy provides a non-subjective maximum likelihood estimate of the parameter values (non-subjective means that we neglect any prior knowledge of the parameter values). If we take the prior into account however, we recover the most simple form of subjective inference, maximum a posteriori (MAP) estimation

which describes the set of parameter values, v, which provide a Pₐ as close as possible to P. A word of caution must be emphasised here — since we use a sampling distribution, s, maximising the likelihood (or posterior) actually provides us with the distribution Pₐ which is closest to the sampling distribution, s. If s is not representative of P, then the model will not necessarily be a good fit of P. A second word of caution — although Pₐ may be as close as possible to P (or actually s) with this set of v, the mode of the likelihood (or posterior) could actually be very far from the high density regions of the distribution and therefore not be representative at all of the more likely model parameter values. This is avoided when considering the entire posterior distribution using MCMC techniques or similar. Essentially, using maximum likelihood or maximum a posteriori estimation, the epistemic error will be massively underestimated without taking in to account the bulk of the prior (or posterior) probability density.

Model comparison

Note that there is no statement so far saying whether a model a is actually any good. We can measure how good the fit of the model is to data d by calculating the evidence, which is equivalent to integrating over all possible values of the model parameters, i.e.

By choosing a different model a′ with its own set of parameters u Eᵤ we could then come up with a criterion which describes whether model a or a′ better fits the data, d. Note that this criterion isn’t necessarily well defined. Do we prefer a model which fits the data exactly but has a semi-infinite number of parameters, or do we prefer an elegant model with few parameters but with a less good fit? Until neural networks came about we normally chose the model with the fewest parameters which fit the data well and generalised to make consistent predictions for future events — but this is still up for debate.

Everything described so far is none other than the scientific method. We observe some data and want to model how likely any future observations are. So we build a parameterised model describing the observation and how likely it is to occur, learn about the possible values of the parameters of that model and then improve the model based on some criterion like Occam’s razor, or whatever.

Neural networks as statistical models

Deep learning is a way to build models of the distribution of data

No matter the objective — supervised learning, classification, regression, generation, etc — deep learning is just building models for the distribution of data. For supervised learning and other predictive methods we consider our data, d, as a pair of inputs and targets, d = (x, y). For example, our inputs could be pictures of cats and dogs, x, accompanied with labels, y. We might want to then make a prediction of label y for a previously unseen image x′ — this is equivalent to making a prediction of the pair of corresponding input and targets, d, given that part of d is known.

So, we want to model the distribution, P, using a neural network, f: (x, v) ∈ (E, Eᵥ) g= f(x, v) G, where f is a function parameterised by weights, v, that takes an input x and outputs some values g from a space of possible network outputs G. The form of the function, f, is described by the hyperparameterisation, a, and includes the architecture, the initialisation, the optimisation routine, and most importantly, the loss or cost function, Λₐ : (d, v) (E, Eᵥ) Λₐ(y|x, v) ∈ K · [0, 1]. The loss function describes an unnormalised measure for the probability of the occurrence of data, d = (x, y), with unobservable parameters, v. That is, using a neural network to make predictions for y when given x is equivalent to modelling the probability, P, of the occurrence of data, d, where the shape of the distribution is defined by the form and properties of the network a and the values of it parameters v. We often distinguish classical neural networks (which make predictions of targets) from neural density estimators (which estimate the probability of inputs, i.e. the space G = [0, 1]) — these are, however, performing the same job, but the distribution from the classical neural network can only be evaluated using the loss function (and is normally not normalised to integrate to 1 like a true probability). This illuminates the meaning of the outputs or predictions of a classical neural network — they are the values of the parameters controlling the shape of the probability distribution of data within our model (defined by the choice of hyperparameters).

As an example, when performing regression using the mean squared error as a loss function, the output of the neural network, g = f(x, v), is equivalent to the mean of a generalised normal distribution with unit variance. This means that, when fed with some input x, the network provides an estimate of the mean of the possible values of y via the values, v, of the parameters, where the possible values of y are drawn from a generalised normal with unit variance, i.e. y ~ N(g, I). Note that this model for the values of y may not be a good choice in any way. Another example is when performing classification using softmax output, we can interpret the outputs directly as the pᵢ of a multinomial distribution, where the unobservable parameters of the model, v, affect the value of these outputs in a similar way to parameters in a physical model affect the probability of the occurrence of different data.

With this knowledge at hand, we can then understand the optimisation of network parameters (known as training) as modelling the distribution of data, P. Usually, when training a network classically, our choice of model allows any values of v ~ pₐ = Uniform[-∞, ∞] (although we tend to draw their initial values from some normal distribution). In essence, we ignore any prior information about the values of the weights because we do not have any prior knowledge. In this case, all the information about the distribution of data comes from the likelihood, Lₐ. So, to train, we perform maximum likelihood estimation of the network parameters, which minimises the cross entropy between the distribution of data and the estimated distribution from the neural network, and hence minimises the relative entropy. To actually evaluate the logarithm of the likelihood for classical neural networks with parameter values v, we can expand the likelihood of some observed data, d, as Lₐ(d| v) Λₐ(y| x, v)s(x), where s(x) is the sampling distribution of x, equivalent to assigning normalised frequencies of the number of times x appears in S. Evaluating Λₐ(y| x, v) at every y in the sampling distribution when given the corresponding x, taking the logarithm of this probability and summing the result therefore gives us the logarithm of the likelihood of the sampling distribution. Maximising this with respect to the network parameters, v, therefore gives us the distribution, Pₐ,which is closest to s (which should hopefully be close to P).

And that’s it. Once trained, a neural network provides the parameters of a statistical model which can be evaluated to find the most likely values of predictions. If the loss function gives an unnormalised likelihood methods like MCMC can be used to obtain samples which characterise the distribution of data.

A few cautions must be considered. First, the choice of loss function defines the statistical model — if there is no way that the loss function describes the distribution of data, then the statistical model for the distribution of data will be wrong. One way of avoiding the assumption for the distribution is by considering loss functions beyond mean square error, categorical cross entropy or absolute error — one ideal choice would be the earth mover distance which can be well approximated by specific types of neural networks and provide an objective which simulates the optimal transport plan between the distribution of data to the statistical model, thus providing an unassumed form for Pₐ. Another thing to note is that a statistical model using a neural network is overparameterised. A model of the evolution of the universe needs only 6 parameters (on a good day) — whilst a neural network would use millions of unidentifiable parameters for much simpler tasks. When doing model selection where model elegance is sought after, neural networks will almost always lose. Finally, networks are made to fit data, based on data — if there is any bias in the data, d S, i.e. s is not similar to P, that bias will be prominent. Physical models can avoid these biases by building in intuition. In fact, the same can be done with neural networks too, but at the expense of a lot more brain power and a lot more time spent writing code than picking something blindly.

So deep learning, using a neural network and a loss function, is equivalent to building a parameterised statistical model describing the distribution of data.

Tom Charnock is an expert in statistics and machine learning. He is currently based in Paris and working on solving outstanding issues in the statistical interpretability of models for machine learning and artificial intelligence. As an international freelance consultant, he provides practical solutions for problems related to complex data analysis, data modelling and next-generation methods for computer science. His roles include one-to-one support, global collaboration and outreach via lectures, seminars, tutorials and articles.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Tom Charnock
Tom Charnock

Written by Tom Charnock

Expert in statistics and machine learning working on solving outstanding issues in the statistical interpretability of models for artificial intelligence.

Responses (1)