Understanding the moments of a random variable

9 min readOct 21, 2022

Given a random variable X and its distribution, the moments are synthetic indices that can be used to describe the distribution.

How are they defined and calculated? What do they mean?

Given a random variable X, its distribution is defined as the probability mass function (PMF) pₓ for discrete variables and the probability density function (PDF) f(x) for continuous variables.

The n-th moment of a random variable X is defined as the Expected Value of the n-th order

Definition of the n-th moment of a random variable.

where S is the Support of the distribution.

A brief digression: what is difference between the Domain and the Support of a function? Consider a random variable X whose PMF is defined as p₁ = 1 when you’re right and p₀ = 0 when you’re wrong: in this case Domain and Support are the same {0,1}. But now take a random variable Y whose PMF is p₁ = 1 when “it itself” is right and p₀ = 0 when “it itself” is wrong. For p₁ there’s no problem, but for p₀ something weird happens: if “it itself” is wrong, it can’t be true that p₀ = 0, so if this is false, it must be true that p₀ = 1, but this is false because p₀ = 0, so it itself is wrong, etc... This is a paradox of an undecidable sentence. In this case, the Domain is {0,1} because the function is there defined, but the Support is only {1} because 0 is not… “supported”. Many functions in Statistics have got a wider Domain but a narrower support, such as the Continuous Uniform distribution U(a, b) where f(x) = 1/(b-a) and whose Domain is (-∞,+∞) but whose Support is [a,b] only: outside the Support f(x) is defined as 0.

Let us focus on each single moment with an example to understand what they mean. We will use discrete variables for the examples, which are easier to deal with.

The 1st moment: mean

The first moment of a random variable distribution is the mean μ. It is mathematically defined as

Let us take an example. Say we’ve got an urn with 100 marbles, numbered from 1 to 3: there are 30 marbles ①, 60 marbles ② and 10 marbles ③. We want to compute the arithmetic mean of the values. We simply can sum up all the values and the divide by the total number of marbles:

Arithmetic mean of the urn.

But, since we’ve a got sum at the numerator, we can split the fraction and simplify

What 0.3, 0.6, and 0.1 are? Well, they are the probabilities to draw that marble!

The probability to draw a ① marble is 30/100 = 0.3, the probability to draw a ② marble is 60/100 = 0.6, and the probability to draw a ③ marble is 10/100 = 0.1.

So, we are summing up each marble value (x) multiplied by its probability (pₓ). Then, we can also write it as a summation

Generalizing to a random variable X with support S, we can say that

Arithmetic mean of a generalized random variable X with support S.

and, to generalize to a continuous random variable, knowing that a summation becomes an integral, we can write

Mean of a generalized continuous variable X with support S.

The mean μ is a measure of the location of a distribution, where do the data “concentrate”.

Let us now visualize our example because it will be useful for the higher moments:

Distribution and mean of the marbles in the urn.

The 2nd moment: variance

The second moment of a random variable distribution is the variance σ². It is not exactly the 2nd moment of the variable X, but the 2nd moment of the error of the variable X

What are we doing? We’re taking the square of the errors from each value of our variable and the mean, multiplying for the probability and summing everything up. We’re calculating the Mean Squared Error (MSE) of the distribution, which is an alias for variance: Error because we’re subtracting the mean, Squared because we’re raising the error to the power of 2, Mean because (as earlier) we’re multiplying each value for its probability.

The standard deviation σ, is defined as the square root of the variance, and is indeed also called Root Mean Squared Error (RMSE).

We’re taking the error (X-μ) because we want an index of the dispersion about the mean; we’re raising it to the power of 2 (X-μ)² because we’re not interested in the sign; we’re taking the average of the squared errors (X-μ)²pₓ because we want a synthetic index of the entire distribution.

So the variance of our urn will be

and the standard deviation will be its square root

Standard deviation of the urn.

Let us visualize the standard deviation of the distribution

Standard deviation of the marbles in the urn.

So, the standard deviation, derived from the variance, is a measure of dispersion of data about the mean: how much, on average, the data are “far” from the mean. Usually, data that are more than 3 (or 4) standard deviations from the mean are called outliers: the most extreme values of the distribution. In our example, we haven’t got outliers because all values are within 2 standard deviations from the mean: |②-1.8| = 0.2 that is less than a σ; |①-1.8| = 0.8 that is less that 2σ; |③-1.8| = 1.2 that is exactly 2σ. Note, here we didn’t take the “square root of the squared error” but the “absolute value of the error”, but that’s exactly the same because |x| = √x².

The 3rd moment: asymmetry

The 3rd moment of a random variable is the asymmetry index γ₁. Such as the variance, it is not exactly the 3rd moment of a variable X but the 3rd moment of Zₓ, the “standardized version” of X.

To standardize a variable means to take the error from the mean and divide it by the standard deviation: this will lead to a transformed variable with mean μ = 0 and standard deviation σ = 1. Why are we doing it? Because with a standardized variable we can focus on measures that do not depend on the location or on the dispersion.

The “standardized version” of a variable X is usually indicated as Zₓ

Let us first standardize and visualize our urn

Note that now on the x-axis we do not read the values of the marbles but the number of standard deviations from the mean (we indeed calculated earlier). Note also that we’ve got more positive values “to the right of zero” than negative values.

So, given a standardized random variable Zₓ the 3rd moment is defined as

The third moment of a standardized random variable X.

Why should this be a measure of asymmetry and why are we taking Zₓ to the power of 3?

We take the power of 3 because we want to: i) keep the sign of Zₓ, if we raised it to an even power we would have lost the sign; ii) shift to towards 0 the values within one standard deviation and shift far from the mean the values at more than one standard deviation from the mean (the extreme values), in fact Zₓ³<Zₓ if |Zₓ|<1 and Zₓ³>Zₓ if |Zₓ|>1; iii) avoid Zₓ¹ because SUM(Zₓ•pₓ) = 0…! Why? Simple: because it is the 1st moment of the standardized variable Zₓ, i.e. its mean, and we already know that the mean of Zₓ is zero.

Let us calculate and visualize our “cubic standardized” variable

We see that, as expected: ② is shifted towards zero, being within one σ; ① is shifted far from zero but not as much as ③ which is the most extreme value of our distribution.

Let us now calculate the asymmetry index γ₁

Asymmetry index of the urn distribution.

So, our asymmetry index is 0.1, but what does it mean? The asymmetry index is null γ₁ = 0 when the distribution is symmetric. When γ₁ > 0 we say that there is a positive asymmetry or a right asymmetry: the distribution has got a longer tail to the right, i.e. much more extreme values to the right of the mean than to the left. When γ₁ < 0 we say that there is a negative asymmetry or a left asymmetry: the distribution has got a longer tail to the left, i.e. much more extreme values to the left of the mean than to the right.

Let us visualize it with a Binomial distribution (which we’ll talk about in detail in a further story)

Examples of asymmetry with a Binomial distribution.

Why γ₁ is an index of symmetry? Because it would be zero if and only if the data are balanced around the mean, that is the definition of symmetry.

So, since the asymmetry index of the urn distribution is γ₁ = 0.1, we can say that our distribution is not symmetric and has got a positive (right) asymmetry, i.e. a longer tail to the right of the mean.

The 4th moment: kurtosis

Let’s finally focus on the last moment: the kurtosis γ₂. As before, it’s not exactly the 4th moment of a variable X but the 4th moment of the standardized variable Zₓ.

Thus, given a standardized variable Zₓ, the kurtosis is defined as

The fourth moment of a standardized random variable X.

Since we’re taking Zₓ to an even power, we can immagine that this index has something to do with the data dispersion, because we’re ignoring the sign. Why didn’t we take it to the power of 2? Well, because it is always equal to the variance, that is 1…

The 2nd moment of a standardized random variable is always equal to the variance, i.e. 1.

In this case too, when we take Zₓ to the power of 4, the Zₓ² less than one are shifted towards zero and those greater than 1 are shifted towards +∞.

Kurtosis is then something about the dispersion but it’s not “how much average dispersion” that data show (the standard deviation), instead “how the data disperse from the mean”. There is not a “zero” value of kurtosis, so the reference value is the kurtosis of the Normal distribution (we’ll talk about this important distribution in detail in further stories) that is γ₂ = 3. Indeed, the kurtosis is often calculated as excess kurtosis γ₂-3, so that when γ₂-3 = 0 the distribution is said to be mesokurtic (i.e. as kurtic as the Normal distribution). When γ₂-3 < 0 the distribution is platykurtic (less kurtic than the Normal distribution) and when γ₂-3 > 0 is leptokurtic (more kurtic than the Normal distribution).

Kurtosis is a more difficult concept to understand, so let’s take a look at an example.

Example of kurtosis. The black line is a Normal distribution that is, by definition, mesokurtic i.e. γ₂-3 = 0.

leptokurtic distributions have got “steeper” tails with respect to the Normal distribution
platykurtic distributions have got “more gradual” tails with respect to the Normal distribution

So, let us finally calculate the “quadratic standardized” variable Zₓ⁴ and the kurtosis of our urn distribution

Our urn distribution is then platykurtic and we can verify this, comparing it with a Normal distribution with the same mean and variance

Comparison between the urn distribution and the Normal distribution with same mean and variance.

We see that the urn distribution “falls a little bit more softly” than the Normal distribution: the probability masses of ① and ③ are indeed greater than the Normal distribution probability density function.

Understanding the moments of a random variable

The 1st moment: mean

The 2nd moment: variance

The 3rd moment: asymmetry

The 4th moment: kurtosis

Written by Massimo Pierini