Probability Preliminaries

Published in

CUNY CSI MTH594 Bayesian Data Analysis

6 min readSep 10, 2019

There are a fair number of things that the first chapter of BDA3 (Bayesian Data Analysis, 3rd ed) skips over — considering them to be expected prerequisites. I will spend a few lines of text unpacking some of these, to make sure we all are on the same starting level.

What is probability?

A probability space is a triple of things:

A set Ω of outcomes.
A set ∑ of possible events — an event is a subset of Ω, but we are allowed to disallow some events. This way we can avoid some pathological behaviors.
A measure — ie a function ℙ: ∑→[0,1]
(note: measures can take values in [0,∞) — but probability measures are restricted to [0,1])

We need for these to fulfill certain conditions. Some of the most important ones are:

ℙ(Ø) = 0
ℙ(Ω) = 1
ℙ(X union Y) = ℙ(X) + ℙ(Y) if X is disjoint from Y

It’s useful to distinguish between the cases where Ω is discrete or continuous.

Most if not all cases we will be dealing with are well behaved in a number of nice ways. One of them is that the probability will very often be given by a probability density (or probability mass in the discrete case) — in other words a function p such that

The probability of an event E is the integral over E of the density function.

Throughout, a lot of probabilities and probability manipulations show up as integrals. The same statements hold for discrete cases — swapping each ∫ against a ∑ and summing instead of integrating.

To give us a couple of running examples to check everything against:

With Ω=ℝ and p(x) as to the left here, we get the ubiquitous standard normal distribution.
With Ω={1,2,3,4,5,6} and each outcome sharing the same probability 1/6, we get the probability space of a fair D6. This is an example of a discrete uniform distribution.

Joint and Marginal Probability

Given two probability spaces with outcome sets Ω.1 and Ω.2, we can form a new probability space with outcome set Ω.1 x Ω.2 — the Cartesian product of the two original spaces. We will skip the construction of the corresponding event set and probability measure.

One useful way to visualize this is where you can draw a probability density for either of the sets as a function graph over a line, say:

The joint probability of two probability spaces is a function over the product of the two — so something closer to a function on a square or a plane. We can visualize that as a square with colors representing function values like follows:

When writing in terms of density functions, the density of a joint probability at a particular pair of outcomes (x,y) is written p(x,y). Similarly we can define p(x,y,z,…) for taking the joint probability from several densities.

We can recover the component probabilities through integration: the marginal probability density of x is given by the integral p(x) = ∫ p(x,y)dy over all possible values for the other component.

This amounts to integrating down (or across) a single line cutting through the density map.

Conditional Probability

In Bayesian data analysis, almost everything focuses on conditional probabilities: given that we know some new data y, what happens to our knowledge about some unknown quantity 𝜃?

These conditional probabilities can also be defined in terms of the same kind of lines cutting through the joint probability square as in the last section. However, instead of integrating along the line and taking the result to be the interesting quantity, conditional probability restricts the function to the line and then rescales to create a new probability density function.

The rescaling is such that integrating along the line should now produce the total probability of 1 to fulfill the axioms we mentioned earlier. But integrating along a line produces the marginal probability from above. So by dividing by the marginal probability we get the resulting probability density.

Based on this we define the conditional probability of x given y to be
p(x | y) = p(x, y)/p(y).
We can easily work with more than two densities here:
p(x | y, z) = p(x,y,z)/p(y,z)
p(x, y | z) = p(x,y,z)/p(z)

The definition here rewrites into p(x,y) = p(x|y)p(y). This can unpack into a whole chain of conditional probabilities:
p(x,y,z) = p(x|y,z)p(y,z) = p(x|y,z)p(y|z)p(z)

This points us to an important set of rules to keep in mind: how things in a conditional expression can move around — and how they can be added and removed. This starts with the notion of independence.

Independent Probabilities

Two events are independent if conditioning does not influence the probability:
p(x|y) = p(x)

This definition looks asymmetric — what about p(y|x) — but we’ll cover that in a moment.

If the two events are independent, then p(x,y) = p(x|y)p(y) = p(x)p(y). This is often taken as the actual definition — two events are independent if their joint probability (or density, or mass) is the product of the components.

The density depicted earlier is not of two independent densities. In such pictures, independence shows up as a kind of stripiness.

Clearly when defining in terms of the joint probability, we see the symmetry. We can also recover this directly:

Theorem If p(x|y) = p(x) then p(y|x) = p(y).

We have — sort of accidentally along the way here — also proven the most fundamental building block in Bayesian statistics: Bayes theorem.

Bayes Theorem p(y|x) = p(x|y)p(y)/p(x).

Conditional Independence

One way of viewing conditioning is that we are restricting our probability space Ω to some subset Ω’ in ∑. But the restriction is also a probability space — so everything we have done holds true inside the restriction! In particular, we can define conditional independence as the independence within this restriction:

p(x|z) is conditionally independent to p(y|z) if p(x,y|z) = p(x|z)p(y|z).

Proving equivalence between this factoring definition and the conditioning cancellation definition above is straightforward.

Rules for manipulating conditionals

Pulling all of this together, we get that

p(x,y) = p(x|y)p(y)
p(x,y|z) = p(x|y,z)p(y|z)
if you are already conditioning, the condition just carries along with you.
If x, y are conditionally independent given z, then
p(x|y,z) = p(x|z)
It is also often useful to know that you can get the margin by integrating over all possible conditionings. This is like using points on vertical stripes in the joint probability square to compute the integral along a horizontal stripe:
p(x) = ∫ p(x|y)p(y)dy — continuous case
p(x) = ∑ p(x|y)p(y) — discrete case

More useful probability notation and results

Here are some more of the things BDA3 brings up in Chapter 1.

Expectation (that I like to write 𝔼[x]) is defined as 𝔼[f(x)] = ∫ f(x)p(x)dx.
Expectation is linear: 𝔼[ax+by] = a𝔼[x]+b𝔼[y]— because integrals are linear.
Variance (that I like as 𝕍[x] — but I’m possibly too fond of blackboard bold) is the expectation of the mean square deviation from the mean.
𝕍[x] = 𝔼[(x-𝔼[x])²] = 𝔼[x²]-𝔼[x]².
More importantly we get covariance — also for random vectors (yes, we skimmed past the definition of random variables here. sorry.) — as the expected product of deviations from the expectation:
cov(x,y) = 𝔼[(x-𝔼[x])(y-𝔼[y])ᵀ]
The expected value when conditioning changes with what you condition on — but if you take the expected value over all conditions, you get the marginal distribution back:
𝔼[𝔼[x|y]] = 𝔼[x]
For variance, we get a combination of expectations and variances:
𝔼[var(x|y)] + var(𝔼[x|y]) = var(x)
4. and 5. hold in a vector setting as well.

Now, this should set our stage pretty thoroughly I hope.