To begin your journey into probabilistic deep learning, we must first revisit some fundamentals of statistics and probability theory. Most of the definitions are taken from the Model-Based Machine Learning book, but elaborated in the context of simple examples. If you do not already know some of these definitions, write them down on flashcards and learn their definitions and formulas as they were vocabularies.
Context of our first examples: you’re having breakfast with your grandma once a week, but she’s a bit demented and never shows up on the same day; plus she always makes a delicious cake that you are not skilled enough to prepare on the other days of the week.
Probability: a measure how (un)certain your decision is. It lies between 0 and 1, where 0 means impossible and 1 means certain. Probabilities are abbreviated as P, and often expressed as a percentages (such as 0%, 50% and 100%), or fractions (such as 1/5, 3/7, and 7/8).
Example: every morning you ask yourself what’s the chance you’ll have cake for breakfast? It’s P(breakfast = grandma’s cake) = 1/7.
Random variable: a named quantity (= variable) whose value is uncertain. Consequently, we can only give a probability that this quantity has a given value.
Example: the random variable is breakfast, and its value is grandma’s cake. The probability that breakfast will be grandma’s cake is 1/7.
Normalization constraint: the probability theory’s law that all probability values of any given random variable must add up to one.
Example: if your grandma was not coming, you would eat cereals for breakfast.
P(breakfast = cereals) = 6/7
P(breakfast = grandma’s cake) + P(breakfast = cereals) = 1
Probability distribution: a function which gives the probability for every possible value of a random variable.
Probability mass function: a function of a discrete random variable, whose sum across an interval gives the probability that the value of the variable lies within the same interval. It gives the probability that a discrete random variable is exactly equal to some value. We must define it for each possible value. It’s abbreviated as PMF.
f(grandma’s cake)=P(breakfast=grandma’s cake)=1/7
Probability density function: a function of a continuous random variable, whose integral across an interval gives the probability that the value of the variable lies within the same interval. It’s abbreviated as PDF.
Example: Our example has only a PMF, because it is discrete, but the probability distribution of the day temperature has a PDF. The subsequent function is the PDF of a normal (=”bell shape”) distribution.
Expected value: the long-run average value of repetitions of the experiment it represents. We weigh each observation x with its probability p.
If you don’t know the probability of each observation, calculate the mean with the subsequent formula and you’ll see, it’s exactly the same. It’s abbreviated as m.
Example: You can calculate
E[breakfast]=grandma’s cake * 1/7 + cereals * 6/7 or
but it doesn’t give you much insights. In general, it’s much more often used with continuous variables. The mean of of our day temperature example is approx. 20°C.
Median: the value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.
Mode: the value that appears most often.
Example: Mode of breakfast = cereals, because it appears more often than grandma’s cake.
Variance: a measure how much the values vary around the mean. It is not the average variation of any value around the mean (this is called mean absolute deviation). We calculate the variation/”distance” between each observation x and the mean m of all x’s, and divide it by the number of observations N. Abbreviated as σ².
Example: see the above graph as a histogram. All x’s are the single observations, e.g. 10°C, 15°C, or 30°C, m is the mean, 20°C. If we measured the temperature every hour, we would have N=24.
Mean absolute deviation: the average of the absolute deviations from the mean m.
Standard deviation: the square root of the variance. Abbreviated as σ.