Heads or Tails: Parameter estimation for coin toss in Cricket

Pankaj Agarwal
Analytics Vidhya
Published in
11 min readMay 11, 2020
Source — rediff.com: Toss between India (Virat Kohli) and South Africa (Faf du Plessis)

Objective

The goal of this post is to educate the readers about few popular statistical parameter estimation techniques. The post will rely upon coin toss which happens in the game of cricket as a practical application and readers can relate to the cases easily. I have written the article in dialogue format and mentioned the statistical concepts in between. One can choose to skip or read the statistics section depending on interest.

Situation

South African Cricket team captain Faf du Plessis has lost the toss 9 times in a row so far. His next match is against India and Faf du Plessis is planning his strategy for the 10th toss. He doesn’t want to lose another toss. Below are the 3 options which are going around in his mind —

  • Should he call Heads?
  • Should he call Tails?
  • Should he ask someone else to call the toss since he is out of luck?

He consulted Mark Boucher who is the coach of the team about his dilemma.

Conversation I: between Faf du Plessis and Mark Boucher

Faf du Plessis: Mark I have lost 9 tosses in a row. I don’t want to lose the 10th one and become a subject of mockery. Please help me?

Mark: Yeah I know. Its pretty awkward to be in this state. By the way, what did you call in those 9 tosses? [Actually players usually stick to either heads or tails depending on their belief.]

Faf du Plessis: I am a believer of heads and call the same every single time.

Mark: Then in my opinion you should stick to heads in the next toss.

Faf du Plessis: I would like that since that’s my natural choice. But are you sure about this ?

Mark: Its simple !! By “law of averages” the chances of heads are higher after so many tails. However, let me check with our team’s analyst about this and take their opinion also but I recommend heads.

Law of Averages: the supposed principle that future events are likely to turn out so that they balance any past deviation from a presumed average.

Conversation II: between Mark Boucher and Analyst

Now suppose you are the analyst in the team. Mark approaches you for help and informs you of his opinion about calling heads in the next toss.

You: Mark, your belief that heads will be the likely result in the next toss after 9 consecutive tails is a classic example of what we call Gambler’s Fallacy in Statistics. Since the coin tosses are independent, the probability of heads or tails in the 10th toss doesn’t depend on the results of previous tosses. It will still be 1/2 for an unbiased coin. The “law of averages” is a myth. A more appropriate rule is called “the law of large numbers”.

Gambler's Fallacy: It is the erroneous belief that if a particular event occurs more frequently than normal during the past it is less likely to happen in the future (or vice versa), when it has otherwise been established that the probability of such events does 
not depend on what has happened in the past.
Law of Large Numbers: According to the law of large numbers, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

Mark: So you are saying that Faf still has an equal chance of getting either tails or heads in the 10th toss ? What rubbish is this ? How on earth can he get 9 consecutive tails then ?

You: I understand your concern Mark. The probability of 9 consecutive tails (assuming the coin is unbiased) is 1/2⁹ = 0.00195. This is a very very small number but still not zero. Hence, in my opinion Faf’s luck has been really bad.

Mark: Well, I respect your opinion. However, I’ll tell you a secret. I got to know that the same coin was used in the previous 9 matches. The umpires will likely use that same coin in this match too. Is there a possibility that the coin is loaded (biased) ?

You: Interesting viewpoint Mark. Let me come back to you on this.

Frequentist Inference: Estimate based only on observed data

You assume that the coin is loaded with unknown probability of heads as θ. Our task is to provide an estimation about θ given the evidence / past results.

Suppose you toss the coin “n” times and got heads “h” times.

Now the task is to find the most appropriate value of θ given the observations. One way to do that is called Maximum Likelihood Estimation (MLE).

Each coin toss can be modeled as a Bernoulli Distribution with parameter θ. The likelihood of observing the results we observed is expressed as —

Likelihood for a sequence of observed Bernoulli Random Variables

We need to find the parameter θ̂ which maximizes the above likelihood. For ease of calculations, we can instead maximize log(L). We can do that because logarithm is a monotonically increasing function. Following our standard calculus trick, we can take derivative and equate it to zero to maximize the log-likelihood.

MLE result of the true but unknown parameter (which is true probability of heads in our case).

Thus, true parameter (θ) which is the probability of heads can be estimated as fraction of heads observed in all the coin tosses. This was kind of intuitive. Still, we proved it using a popular technique called MLE. This is popularly called frequentist way of point estimation.

The MLE point estimate θ̂ alone does not give much information about θ. In particular, without additional information, we do not know how close θ̂ is to the real θ. Being a statistician can we provide more than a point estimate ?

We can actually provide a confidence interval (CI) estimate along with θ̂ using the “Central Limit Theorem”.

Central Limit Theorem: The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. 

Assuming the Central Limit Theorem is valid in our case even though we have only 9 points, we can compute 95% confidence interval as —

Formula for Confidence Interval for a Bernoulli parameter

In the above formula, we assumed the unknown true parameter (θ) to be 1/2 so that we get the widest possible Confidence Interval. Also for 95% confidence, the value of alpha = 0.05. We can refer a normal distribution table and compute the value of “Z” as 1.96.

In the case of Du plessis, all the coin tosses resulted in tails. Thus,

Negative probability of heads is meaningless here, hence we report the probability of heads given the observed streak of 9 tails as [0, 0.327]

Conversation III: between Mark Boucher and Analyst

You: Mark, I did some calculations and have considered your hypothesis that the coin might be loaded. I found that the true probability of heads should lie between 0 to 32.7%.

Mark: Wow, looks like my guess is right. Are you sure about this ?

You: Nothing can be written in stone with Statistics as there is always an uncertainty. However, I can say with 95% confidence that the probability of heads is between 0 to 32.7% . Hence, there is enough evidence to believe your hypothesis that the coin was biased. You should probably raise this with the ICC. Also Faf should call tails in the next toss as that outcome is more likely.

Hypothesis Testing and Confidence Interval: One can perform hypothesis testing by computing the confidence interval of the unknown parameter. In this case - Null hypothesis (H0) = Coin is Unbiased (Probability of heads = 0.5)
Alternate hypothesis (H1) = Coin is biased (Probability of heads != 0.5)
Outcome = We reject the null hypothesis as p-value < 0.05

Mark: Interesting, then I would ask Faf to go for tails using your expert opinion. I’ll think about raising this issue with ICC. However, you need to be absolutely sure about me raising this with ICC. Both of our jobs can be on the line because of this allegation.

You: Please note that the confidence interval of [0, 32.7%] towards heads has been computed by only considering the current evidence. Can you try and share with me past results of toss conducted using that same coin ? I can reinforce my belief if you can provide me some historical data.

Mark: Sure, I checked about this with the team. I could only fetch the results of 100 previous tosses conducted through the same coin. To my surprise that coin resulted in 60 heads and 40 tails.

You: Great Mark, this changes everything for us. Give me sometime and I’ll be back in a while with my final judgement.

Updating belief using Bayesian Inference

Now we have some additional information about the coin which can provide some information about our unknown parameter θ which is called Prior in statistical parlance. In our case the prior belief is that the probability of heads is 60/100 = 0.6.

We need to somehow mix the prior information with the likelihood (which are the observed results in 9 coin tosses). The likelihood says that the probability of heads is 0.

As you might have guessed, the actual answer can be in between prior and likelihood but we need a mathematical tool to compute this.

Bayes theorem comes to rescue as it provides us a very elegant way to combine the prior with likelihood. It is expressed in terms of probability distributions as:

where f (θ|data) is the posterior distribution for the parameter θ, f(data|θ) is the Likelihood function , f(θ) is the prior distribution for the parameter, and f(data) is the marginal probability of the data.

For practical purposes, the takeaway from above is posterior ∝ Likelihood * Prior.

In our case, we can model the true probability of heads as Beta distribution and the likelihood as Binomial distribution. The Beta-Binomial are called conjugate pair and they can together model our unknown parameter very easily here. The modeling choice here is only for mathematical convenience but one can go with other choices too.

Beta Distribution -  The beta distribution is a family of continuous probability distributions defined on the interval [0, 1]. It is parametrized by two positive shape parameters denoted by α and β. It can be used to model an unknown parameter between 0 to 1. Hence, it is colloquially known as distribution of distributions.
Mathematical Proof for the Beta-Bernoulli Conjugate Pair (SourcewWikipedia.com)
Bayesian Model:Prior ∼ Beta(α, β);Likelihood ∼ Binomial(n, θ);
Posterior ∼ Beta(s+α, f+β)
Notation:
s = # heads in the 9 tosses = 0.
f = # tails in the 9 tosses = 9.
α = #heads in prior information = 60
β = #tails in prior information = 40

After plugging in the values, the resultant model in our case is — :

Posterior ∼ Beta(0+60, 9+40) ~ Beta(60, 49)

In Bayesian Model, the point estimate can be computed by computing the expectation of the distribution. Note that, in the “Maximum a posteriori (MAP)” way of parameter estimation, we choose the point estimate as the mode of the distribution because it has the highest probability density.

The credible set (CS) is analogous to the confidence interval (CI) in the frequentist framework. They look similar but are interpreted differently. Below snipped computes the point and interval estimate for us —

Code Snippet for finding the credible set and plotting the posterior

Expected value of θ or the mean of Beta distribution:

E(θ) = 60 / (60 + 49) = 0.55

95 % Equitailed Credible Set of this beta distribution is shaded in green:

CS for θ ~ [0.4568 , 0.6423]

Conversation IV: between Mark Boucher and Analyst

You: Hey Mark, based on the additional information you provided me, the true probability of heads is between 45.68% to 64.23 % with an expected (average) value of 55%. Again as usual, I am 95% certain about these results.

Mark: Okay. So you got the average probability of heads as 55%. Are you now reversing your judgement and saying that the coin is instead biased for heads ? What the hell is going on ?

You: Based on the additional data you provided the true probability of heads should lie between 45.68% to 64.23%. This range includes 50% probability of heads which imply that the coin can still be unbiased. In essence, there isn’t enough evidence that the coin is biased towards any outcome and so we can’t approach the ICC for this.

Null hypothesis (H0) = Coin is Unbiased (Probability of heads = 0.5)
Alternate hypothesis (H1) = Coin is biased (Probability of heads != 0.5)
We fail to reject the null hypothesis since our 95% credible set contains 0.5, in frequentist world this is equivalent of stating that the p-value > 0.05

Mark: Sorry, most of it got above my head. What’s your final recommendation ?

You: Since, we could not prove that the coin is biased, I would say that Faf has just been unlucky so far. Hence, probably he should ask another player to call the toss. That player can call anything he wants. Both will be equally likely.

Mark: Okay sure. I’ll convey this to Faf. Thank you very much.

Summary

After 9 consecutive tails as the result, here are the options with du Plessis for the 10th toss against Virat Kohli ?

Advice of team’s analyst: Proxy Captain

I hope you realize that all the above conversations were hypothetical.

However, Faf du plessis in fact took a proxy captain (Temba Bavuma) with him for the 10th toss and guess what — he still lost the toss. Interested people can watch the video of the toss below:

This just means that even Du Plessis was unlucky and nothing else. “It shows that it isn’t meant to be (win the toss with a proxy captain)” du Plessis said after losing his 10th consecutive toss in Asia.

Faf du Plessis joined the rank of few other players who have recorded long streak of losing the toss. The list includes the likes of Graeme Smith and Nasser Hussain with 8 and 10 toss losses respectively.

Conclusion

The cricket analogy was just to help you connect above statistical concepts with a real use case scenario. The techniques covered here applies to any problem where we have some observed data & some prior [optional] and our task is to estimate an unknown parameter.

Maximum Likelihood estimate is a frequentist technique which just depends on the observed data. Bayes estimation takes into account the prior and is thus a more robust technique in my opinion. Note that while an unknown parameter is treated as a fixed constant in frequentist framework, it is considered as a random variable in the bayesian framework. We can perform both point estimation and interval estimation in either of the above frameworks and interpret those results effectively to our business folks.

--

--

Pankaj Agarwal
Analytics Vidhya

Building Search and Recommendation Systems at Myntra !!