To expect or not expect?

One of the most important formula in statistics

Adam Lewandowski
10 min readOct 6, 2023
Photo by Igor Omilaev on Unsplash

Introduction

Uncertainty is one of the most likely parts of our life. We have to obey it in some way to solve problems. There are two approaches to this solution; maximize certainty or minimize uncertainty. Thoughtful can be obtained by increasing the quality of knowledge or data consistency. The second approach requires from us to look different on random things, how samples are related each other. In this ideas we can use Expected Value, a statistics function, which is sometimes improperly understood because of the semantic meaning of the term name.

In the following article we will discover why it’s happens, what’s the reasoning of this concept and when we can use this tool in our work.

General definition

Expected Value (written by E(X), E[X], or EX) „is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a large number of independently selected outcomes of a random variable. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would “expect” to get in reality.” [1]

Let’s focus on the interpretation of the bold text:

  • weighted average” — we are using mean function to get the most representative value, which can replace all argument numbers in some context; it is the closest value to all of the arguments (if we square the distance, if not then median is the closest value). Weighted average is the final aggregation of a normal mean function,
  • large number of independently selected outcomes” — it’s required to gather large samples amount to produce Expected Value, without achieving this condition we can get close or far number from it’s real expectation,
  • it is not the value you would “expect” to get in reality” — since it’s the aggregation of the samples, it gives information about the dataset we have; we can’t even mix two separated abstraction levels, even if the mean value exists in the dataset. Like comparison between apple and apple’s box — it’s look similar, but it’s not the same concept.

What we know now, is that Expected Value is a (weighted) mean function of some numbers in a large size dataset and we cannot even expect to being in the dataset.

If we look on the formula we can clearly see the ease of calculating it:

Simply, we have to calculate sum of the multiplication between proportionals of the value and the value. Each value belongs to the dataset, it’s just a unique value that exists many times. Proportionals gives the higher level information about the value; how many times it’s repeated.

Why we are talking about the average of the values in the dataset? It’s so simple to calculate it, so why it is so popular? When we will get the real Expected Value? Can we even thing of it if we have small dataset? Why we want even to calculate it? 🤨

To get the answers we must go to the history roots, in the year 1494 …

Problem of points

Renaissance has been a flowering period of accounting in the Italy, especially Florence banking system. Pacioli, a father of accounting methods, expands best practices by double-entry bookkeeping, future reward and many modern accounting-related things in his book “Summa de arithmetica, geometrica, proportioni et proportionalità”. Additionally, he introduced in it a new problem related to the area of Game Theory. [2]

We have a fair, double-side coin and two players (respectively P1 and P2). They bet on K money. If the coin flips reverse, then player 1 wins the round, alternatively if the coin flips obverse, then player 2 wins the round. First player with Q rounds won gets all of the sum. Although, during the game they are obligated to stop the game and make a fair division of this sum between them. How this sum should be divided, if first player won X (X < Q) rounds and the second won Y (Y < Q) rounds ?

There was several algorithms to this problem. [3 — Katz, Victor J. (1993). A history of mathematics. HarperCollins College Publishers. Pages 445–446]

Pacioli solution

Based on history we can make a fair division of the money. Author suggest, we ought to divide the sum by the win rounds proportion. We can write it as odds probability:

Lack of the information related to the future possible continued game makes weird calculations. If they stop playing after single round, even if the have planned to reach 100 win rounds, winner of this single round gets all of the reward. It sounds very unfair to the second player. 😕

Cardan solution

Cardan suggested dividing the total by the number of rounds remaining to be won. It’s looking more fair, when we are using all of the data we have, related to the past and the future.

It seems to be more fair division for both players if they stop playing after single round (for 1:0 wins up to 10, both divides the money in proportion 10:9), on the other hand if the difference between players is not so significant, it does not looks like perfectly fair division, especially for the loser side (for 9:8 wins up to 10, both divides the money in proportion 3:1). 😒

Tartaglia solution

To mitigate this issue, we have to take into account the difference between the players. In Tartaglia formula, player with leads receives half of the sum and half of the ratio of the advantage over the other player in the quotient to the required number of rounds to win the game.

Based on the formula we can think, that most of the common sense knowledge is included in it. There exists superiority of the players, it works for any number of played rounds, in case of an equal number of round wins, the sum is always divided equally. Although again, it’s not working in some cases, when we have 9:5 odds up to 10 games, it favourites more loser side; it’s better for him to stop the game.

Pascal, Fermat, Huygens solution

Finally, we are reaching the most fair solution, created by the fathers of the modern probability theory. 😀 Instead of relying only on leads or remaining rounds, they based their formulas on possible outcomes, if the game would continue. So it looks similar to the mix of the previous solutions.

So the Expected Value comes from the solution of this simple problem (simple for us, if we know the modern elementary maths 🤭). There can raise the last question according to the problem, if the players decide to play infinitely, how this value will be different, based on our current calculated results?

Law of large numbers [4]

What would happened if players decides to play?

  • length of the game will be larger,
  • players will have less won rounds than unplayed rounds,
  • at some point the difference between them will be insignificant,
  • Expected Value of the game for both of them will tends to the half of the reward,
  • we will have less knowledge, so to be fair we have to divide the reward equally between them.

When both of them has equal chance we know that EX = 0.5. But if we coin is fake, which favourites more player 1, what would change?

Let’s think we know that the current status of the game is equal to 9:1. The game was repeated many times before with the similar results 8:2, 9:3, 8:1, 9:0. We can conclude that player 1 has more chance to win the game. To get the value of that chance, we can calculate the mean frequency of these situations (so the probability). Now we can continue to calculate Expected Value, but we have to include the probability of obtaining the reward, as the success of gaining chain of events, with their chance to happens.

When we calculate it, we will see that it will tends to it, based on the probability of win on the single game. However, if we calculate the frequency after a few new rounds, the probability will be similar to our base probability, also tends to some value, the more the games is played.

So our probability and reward mean tends to some values, the more data we have! 😮

This law is called Law of large numbers, there are two versions of it, depending on our assumptions.

Weak law (Khinchin’s law)

The weak law of large numbers proclaims, as we get more and more dataset samples, the mean of a sample or statistics based on mean, tends to the Expected Value. However, it is only required to be close to this value in a range EX±ε. This law is widely used, cause not always is possible to go below some threshold.

Strong law (Kolmogorov law)

In comparison, the strong law of large numbers required to get exactly the same values as the samples grows. What does it mean “grows”? We have to collect large amount of the dataset to provide stability of the statistic. However, as the amount of samples grows, we are collecting also more information from the population. We can conclude, the more samples we get, the closest our statistics will be to the population. When we get all the samples from the population, our statistics will be equal to the Expected Value. In theory, it is possible, in practice not exactly. 🙁

Experiments [5]

Now we will do some experiments to demonstrate above theory in practice.

Dice rolls (6 sides)

Suppose we have 6-sides fair dice (uniform distribution). We rolled it ~ 1000 times, collect the information about randomized side. In the table below we have the results of each randomization, including accumulative sum, mean, expected mean, checking conditions of weak law of large numbers.

As we can conclude from the table, the more samples we have, the sample of means goes towards the expected value (population mean). After some randomization we can get the difference of the mean and expected value below some low value threshold.

Although some sides appears more in a dataset during randomization process, it does not really matter for the mean statistic calculation — mean is the most unbiased estimator, so it will go directly to the expected value.

If we look on the single randomization process we can see that our sample means goes closer and closer to the population mean. The more samples is included in the calculation process, the more stable average we have. This ensure us in a trust of practical working of large number law.

Dice rolls (36 sides)

Let’s see what would change in the results if we will use 36-sides fair dice instead.

We have the same stopping criteria as in the previous experiment. At the sample number 10 we can compare current difference between sample mean and expected value between these two experiments — they looks similar, however in 36-sides dice rolling is a little higher due to higher variance.

As mentioned before, we have a higher variance, it has confirmation on the above histogram. To achieve the same stability level as in 6-sides dice rolling we would have to rolls much more than 1000 times e.g. 1E6.

Sample mean is not as close to the population mean as it was in 6-sides dice rolling. It can be achieved after ~ 400 dice rolling, 2 times more than in 6-sides rolling.

Stability point

The interesting fact is we can continue these experiments, repeat them many times and conclude new knowledge on a much higher levels.

During 30-times repetition we gathering id of sample, of which all next samples in randomization process has calculated mean very close to the expected value.

Then, we calculate all standard statistics like mean and standard deviation of those series.

Uniform distribution can be approximate to the normal distribution, so we will use this rule. Next, we should know that in range of μ±2σ there are ~ 95% probability to obtaining the data from the normal distribution. The probability of obtaining the value from the right tail equals ~ 2.5%. To sum up, if we chose sample point μ+2σ as stability point (point that guarantee the stability of obtaining very low samples mean), there is a risk of 2.5% to obtained unstable samples means if we repeat the experiment e.g. during pressure measurement when we deal with measurement uncertainty.

Normal distribution data probability. Image from Wikipedia.

Conclusions

To sum up, we showed the original statistics definition of the term, it historical roots and show how we can use it in analytical thinking on different abstraction levels. It is a very useful concept, providing us more predictive results when we are dealing with randomization process.

--

--