Probability Theory Ideas and Concepts

Definitions to Expectation, Variance and Covariance

Jake Batsuuri

Published in

Computronium Blog

5 min readJan 1, 2020

Motivation

Probability theory will allow us to make uncertain statements and reason in the presence of uncertainty. Logic helps us reason about deterministic things, then probability theory is an extension of that into uncertain things.

Information theory allows us to quantify the amount of uncertainty in a probability distribution.

Under the classic division of the umbrella of topics under AI and ML, one way to divide the methods, is to have supervised and unsupervised algorithms. Unsupervised has more uncertainty, but even supervised algorithms are still mathematically uncertain.

Uncertainty can occur in several places:

There is inherent stochasticity in the system being modeled
Incomplete observability of the system
Incomplete modelling, we somehow fail to model the situation completely, which is almost always true

In most cases, it is better to use an uncertain but simple model, rather than a complicated system that is more accurate.

Frequentist Probability

Explains situations where we know that an event can be simulated or triggered again. Like drawing a card.

Bayesian Probability

Explains situations that are hard to simulate and trigger, yet due to its uncertain nature, we treat it as a frequentist probability.

Random Variables

Random variables represent an unknown quantity. By itself, the random variable just represents a bunch of possible states. So it must be coupled with a probability distribution describing how likely each of these states are.

Probability Distributions

The probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.

Discrete Variables

Probability over discrete values is described using a PMF (Probability Mass Function).
You can have Px, Py, Pxy
Which indicate probability distribution for single random values
Or joint probability distribution functions
Some requirements for a PMF are:
P must explain all possible states of x
For all x, P(x) must be between 0 and 1 inclusive.
The sum of all the P(x)’s must be 1
A uniform probability distribution would be 1/k

Continuous Variables

Probability over continuous variables is described using a PDF (Probability Density Function)
Some requirements for PDF are:
p must explain all possible states of x
For all x, p must be greater than or equal to 0
The integral of p(x) = 1
The probability that x lies in the interval [a,b] is given by the integral over p(x)dx

Marginal Probability

Probability of a subset of variables
For example, if we know Pxy, and we want to know the Px, we sum over the y’s of Pxy
For continuous distribution, we integrate over the y of Pxy

Conditional Probability

The probability of some event happening given another event happened is called Conditional Probability.

Py|x = Pyx / Px
Where Px must be greater than 0

The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables can be decomposed into conditional distributions over only one variable:

It’s also called the Chain Rule or Product Rule of probability.

Independence and Conditional Independence

Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one exclusive x and one exclusive y:

Two random variables x and y are conditionally independent given a random variable z, if the conditional probability over x and y factorizes in this way for every value of z:

Expectation, Variance, Covariance

Expected Value, Expectation of some Function f(x) with respect to a probability distribution P(x) is the average or mean value that f takes on when x is drawn from P

There are some algebraic properties of E(f(x)) that make it convenient to study them

Variance, gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from it’s probability distribution

Low variance = all the values are around the E(f(x))
High variance = more spread out distribution
The square root of Variance is called the Standard Deviation

Covariance, gives some sense of how much two variables are linearly related to each other, as well as the scale of these variables

High absolute values = values change a lot and they are far from the Expectation
Positive = both values take on high values together
Negative = one takes on high values, while the other takes on low values
Independent variables must have 0 covariance
But 0 covariance doesn’t necessarily mean independence
Covariance of a random vector x element of R^n is an nxn Matrix

Diagonal elements of the covariance give the variance

Up Next…

In the next article, I’ll talk about different distributions, useful filter functions and some useful theories from different disciplines. If you would like me to write another article explaining a topic in depth, please leave a comment.

For table of content and more content click here.