Exponential Families II: Samples and sufficiency

In our last installment, we defined exponential families and gave examples from a range of familiar distributions. Now I would like to spend some time looking at a few reasons why we should care about these families in the first place.

Sufficiency

One major reason why exponential families are nice is that they deliver sufficient statistics for us basically for free. So what are sufficient statistics?

Definition

A statistic T(y) is sufficient for a parameter 𝜃 if and only if the likelihood function can be factored as

In other words, T is sufficient if we can separate out the influences of the data itself so that they only interact with the parameter through the sufficient statistic.

In exponential families

Returning back to our exponential families, with the generic form

We see clearly that the only interaction between y and 𝜃 is through the function T(y). Hence, the T we needed to choose to determine the family in the first place is a sufficient statistic.

Samples

Given a set y[1], …, y[n] of iid samples from a distribution in an exponential family, their joint likelihood also follows an exponential distribution. Why? Well, because

The final expression here has the same decomposition as the exponential families require.

It follows that for our iid sample, mean(T(y)) is a sufficient statistic for the sample. For most of our needs, we can just throw away everything else about the data — and still do most inferences we are interested in doing.

Sufficient Statistics for Distributions

So what are the sufficient statistics for the distributions we looked at?

  • Normal Distribution, known 𝜎²: mean(y)
  • Normal Distribution, unknown 𝜎²: mean(y^2 — 2µy)
  • Poisson: mean(y)
  • Binomial, known n: mean(y)
  • Gamma, unknown 𝛽: mean(y)
  • Gamma, unknown 𝛼: mean(log(y))
  • Gamma, both unknown: the pair mean(y) and mean(log(y)).

Bayesian updates

Since the exponential families have such a strictly prescribed shape, they form natural candidates for conjugate priors. As we saw already, mean(T(y)) or for that matter ∑T(y) form sufficient statistics for iid samples.

When we are doing Bayesian updates, the contribution from B(y) is absorbed by the proportionality constant — which means that if our prior is in an exponential family with

Then after seeing y with n iid samples, the posterior will be

So that with a sufficiently wide definition of our A (allowing it to take the n into account), we get an automatic conjugate prior.

--

--

Mikael Vejdemo-Johansson
CUNY CSI MTH594 Bayesian Data Analysis

Applied algebraic topologist with wide-spread interests: functional programming, cooking, music, LARPs, fashion, and much more.