Exponential Families I: Definitions and Examples

Many of the distributions we keep meeting in statistics have density functions that can be written as exp[…] for something useful inside the brackets. And many of the things one might want to do (say, compute likelihood for an entire sample, or update a prior, …) boil down to multiplying density functions, which turns into adding the content inside.

It turns out we can save a lot of mileage by working out parts of the theory for such families in the abstract, and then applying it in specific cases as and when needed.

Definition

We call a family of probability distributions an exponential family parametrized by some parameter 𝜃 if there are functions 𝜂, T, A and B such that each density in the family can be written as:

p(y|𝜃) = exp[𝜂(𝜃).T(y) + A(𝜃) + B(y)]
Generic form of an exponential family density

Here, both 𝜂 and T can be vector valued as long as they are of the same size.

The idea is to separate out mixed terms (coming from the dot product) from pure terms (A for the parameter and B for the data).

Examples

A definition helps nothing at all unless we can produce a few examples to show us what’s actually going on. So here are some very common probability densities:

Normal

The normal distribution has density function

p(y|\mu,\sigma²) = \frac{1}{\sqrt{2\pi\sigma²}}\exp\left[\frac{-(y-\mu)²}{2\sigma²}\right]
Normal distribution density function

Once we recognize that 1/√2π𝜎² = -1/2 * log(2π𝜎²), and after expanding the square we can rewrite all of this as

p(y|\mu,\sigma²) = \exp\left[\frac{-1}{2}\left(\frac{-y²+2y\mu-\mu²}{\sigma²}+\log(2\pi) + \log(\sigma²)\right)\right]

Now, depending on which parameters are considered known, the normal distributions are exponential in a few different ways:

𝜎² known, µ unknown:

The mixed term is going to be -1/2𝜎² * 2yµ. Everything else is either pure parameter or pure data.

𝜎² unknown, µ known:

Now both µy and y² contribute to mixed terms.

By looking carefully at all the terms, we can conclude that this is also the exponential family for the case when both parameters are unknown.

Poisson

Recall the Poisson probability mass function:

From this we can now pick out mixed and pure terms quite easily:

Binomial and Bernoulli

Recall the Binomial probability mass function (Bernoulli is the special case of n=1)

This can actually be simplified, and the T made more intuitive, by rewriting the mass function a little bit. By extracting the -k in the exponent into an actual division we are left with

where we write o for the odds: o = p/(1-p). A little algebra shows us that then p = o/(o+1) and (1-p) = 1/(o+1). Taking logarithms, this can be rewritten as

We could imagine not knowing n. It’s weird, but not unfathomable. We get three cases:

n known, p unknown:

Here, k log o is our mixed term. We get the exponential family defined through:

n unknown:

Now our mixed term includes the binomial coefficient. It is not clear to me that this mixed term can be neatly written as a dot product between vectors of fixed sizes.

Beta distribution:

The Beta distribution is very closely related. We leave it as an exercise to derive the exponential family form for the Beta distribution from this.

Gamma

The Gamma distributions include as special cases both the exponential and the chi-squared distributions.

Gamma comes with three different commonly used parameter sets:

  1. Shape k and scale 𝜃
  2. Shape k and rate 𝛽; k is also often called 𝛼.
  3. Shape k and mean µ

These are connected through: 𝛽 = 1/𝜃, and µ = k𝜃 = k/𝛽

Their probability mass functions are:

We get the exponential distribution as the case when k=1 and chi-squared when 𝜃=2. Recall also that 𝛤 here is the factorial on positive integers.

The shape/rate case is an exponential family as follows:

𝛼 known, 𝛽 unknown:

The mixed term is -𝛽y.

𝛼 unknown, 𝛽 known:

The mixed term is (𝛼-1)log y.

both unknown:

Mixed term is a dot product capturing both mixed terms.

The two remaining forms (shape/scale and shape/mean) we leave as an exercise. 😉

For our next post…

Next post will discuss sufficient statistics and how to read them directly off of an exponential family, as well as the sampling distributions for combining iid samples.

--

--

Mikael Vejdemo-Johansson
CUNY CSI MTH594 Bayesian Data Analysis

Applied algebraic topologist with wide-spread interests: functional programming, cooking, music, LARPs, fashion, and much more.