ℙ or Pr or p or … ?

Published in

CUNY CSI MTH594 Bayesian Data Analysis

4 min readSep 11, 2019

There are a LOT of different p floating around anything probability related, and it’s not always clear how they fit together, nor what manipulations are valid where.

So let’s talk about it.

I will be using the following convention here:

ℙ is the probability measure. ℙ(E) is the probability of the event E, which in turn is a subset of the outcome space Ω (check out our previous post for what I mean by this)
I’m personally not all that fond of Pr, so I will be skipping it. We may reintroduce it later on to encode priors. Maybe.
p is the probability density function — and the probability mass function. They boil down to the same thing if you do a bit of futzing around with point masses and linear combinations of Dirac measures.
Their difference will be in whether Ω is discrete or continuous; and in whether we use ∑ or ∫ to combine them.

Random Variables

We haven’t been as diligent as maybe we should have been about random variables — largely because the outcome space itself can be seen as a random variable, and it’s tempting to just blur the distinction.

A real-valued random variable X is a function X:Ω→ℝ. We will also often deal with a vector-valued random variable, in which case instead of ℝ as a target space, we’ll have ℝᵈ for some dimension d.

We can create a new probability space out of a random variable by using ℝ as our outcomes, and picking a collection of nice subsets of ℝ as our events. Unions of open subsets of ℝ is a common candidate here. The probability measure 𝕏 is defined through
𝕏(U) = ℙ(X⁻¹(U))
or in other words, the probability of X∈U is the probability of the event consisting of everything in Ω that gets mapped to U.

For random variables, we define two (three) important functions:

CDF — the Cumulative Distribution Function is defined as
CDF(x) = 𝕏((-∞,x]) = ℙ(X ≤ x)
PDF — the Probability Density Function is the derivative of the CDF. This means that PDF (which we have tended to denote as p) is the function such that
ℙ(E) = ∫ p(x)dx, where the integral goes over E. (Unicode and Medium fail me in writing arbitrary subscripts 😞
PMF — The Probability Mass Function is the discrete version of the PDF. In the discrete case, we can still talk about CDF and about probability measures, but CDF will be discontinuous and the PDF will be ill-behaved. A solution is to introduce something that works almost the same way, the PMF — a function p such that
ℙ(E) = ∑ p(x), where the sum goes over elements of E.

By constructing the PMF as a sum of Dirac distributions (gadgets that integrate to step functions), then this sum is identical to the corresponding integral — this is why people can get quite sloppy in distinguishing between when to use ∫ and when to use ∑.

Nested Expectations

Another topic that may require a bit of attention are the nested expectation equalities we saw when discussing conditional probabilities earlier.

We know for a fact that

𝔼[ 𝔼[X|Y] ] = 𝔼[X]

But how do we get there? And what do we mean with the different expectations?

If we had had proper subscripts (Medium is less and less suited for mathematical text the more I try to use it for that), then we could have made it clearer which random variable and which probability space each expectation operates over.

Let’s be clear in our definitions. By 𝔼[X|Y] we mean the conditional expectation, defined as

We need to pick a single value for the random variable Y to condition on; and then we could think of it as partial evaluation of the joint probability. In the discrete case, as expected, the integral turns into a sum:

Now, as we vary y, this conditional expectation varies too — it is a function of y. So we can take the expectation across different values for y. This gets us