# Is this coin, die or whatever even fair?

Tell me if this has happened to you. You notice that the coin you and your partner flip to decide who washes the dishes seems to land on heads a conspicuous number of times compared to tails. You notice that the set of dice in the Monopoly set at grandma’s house land on a conspicuous number of snake eyes, double 1’s. More broadly, you notice that some kind of random event that should be fair is in fact not and rather more biased towards certain outcomes than others. But how can you test to see if your hunch is truly correct and the coin, die or whatever is in fact fair or unfair?

Well, if you ask a statistician how to check if a coin is fair, for example, she’ll probably say something along the lines of “flip it a bunch.” But how many is “a bunch” and how can can you use that to empirically say whether something is fair or not? In this blogpost, I want to break down exactly how you can use “a bunch” of flips, rolls or whatever to determine whether a random discrete variable is truly probabilistically balanced for all of its outcomes.

Let’s start off with the coin and talk about what we want our empirical statistical test to do at a high level before we get into any real math. I’m using a coin as the example because having only two possible results makes it much easier to get the idea of how it works. Of course, what I discuss here will apply to dice, 6-sided or otherwise, as well as any other random variable with a finite set of possible results. For the coin, though, what we’re trying to do is test if the counts of heads and tails are _significantly_ different than those we would expect for a coin we know to be perfectly balanced for 50% heads and 50% tails.

What do I mean? Imagine you take a coin, check to make sure it does in fact have a heads and tails side and then flip it 1,000 times, getting 500 heads and 500 tails. It seems clear that this is a fair, balanced coin. But what if you got 501 heads and 499 tails? What about 502 heads and 498 tails? 525 heads and 475 tails? Even if the coin is perfectly balanced where it has exactly a 50% chance of heads and 50% chance of tails, you’d expect some wiggle, some small deviation from perfectly equivalent results. In very technical terms, there’s always going to be some wiggle room and variation and the question is how much variation should you expect for a truly fair coin?

Now, let’s take another checked-for-both-heads-and-tails coin and flip it. Heads. Flip it, again. Heads, again. Once more. What do you know? Heads for a third time. Okay, one more time. Heads? Oh, my!

A gut reaction would tell you that something’s fishy here. Four flips and four heads? That does not seem likely for a perfectly balanced coin. However, as mentioned previously, even a perfectly balanced coin will not always yield a set of flips which are exactly matched. Are four flips and four heads within the expected wiggle for a fair coin?

Let’s figure out exactly how likely it would be to get four heads if the coin was indeed fair. There are 16 (2⁴) possible unique sets results when flipping a fair coin four times. Only one of those will have heads occurring four times, meaning that the probability of four flips of a fair coin yielding four heads is p = 1/16 = .0625. This the so-called p-value of the observed data(for the record, p=0 means 0% chance and p=1 means 100% chance). A probability tree (http://ic50.org/probabilitree) showing the potential outcomes of 4 flips. The bars on the right indicate the overall number of outcomes that yield the specified proportion of heads and tails.

What is this p-value? Time for a bit of philosophical tangent. P-values have long been the gold standard of experiments to determine whether some effect is “statistically significant” or not so it’s worth taking a second to talk a little bit more about this. P-value is the probability of the observed data, assuming the “null hypothesis”, which is very technical terms means the hypothesis that nothing interesting it happening. In this case, the null hypothesis is that the coin is boringly fair. Here, our p-value=.0625 but that does not mean that the coin has only a 6.25% chance of being fair. Perhaps someone else flipped the coin 1000 times before your four flips and got 500 heads and 500 tails. Your new experiment of four flips does not overwrite the results of that previous experiment. Rather, the p-value indicates the likelihood of just these observations, i.e., coin flips, and sequences of coin flips or other results of experiments always need to be interpreted with related experiments in mind. If you want to read more about this, I suggest checking out this quick post or this longer (but a lot more tongue-in-cheek) paper by a well-known statistician. They make for a great Friday night.

Dismounting my soapbox, let’s get back to coins and assume that these are the only four flips ever done with this coin. In that case, it seems like the data suggest the coin is biased towards heads. However, are these data enough to make us confident enough to say it’s significantly likely to be unfair? In fact, no. At least, if we follow the commonly used threshold of p<=.05 being what determines statistical significance. In other words, because there is a greater than 5% chance the four heads came from a fair coin, we cannot say that it is significantly likely to be unfair. Granted, there is a lot of debate out there about this threshold, but let’s just use it for the time being. If we were to flip the coin once more and it were to land on heads once again, then we _could_ say the coin is not fair because the probability of a fair coin getting five out of five heads is p=1/32=.03125, which is less than .05. In this event, we can conclude that the coin is significantly likely to be unfair.

This is a good general, abstract idea of how to determine if a coin, die or any variable is fair. If the observed results have less than a 5% chance of coming from a fair variant of the coin, die or whatever, we can say there is a significant chance random variable is somehow biased. This methodology does open a lot of questions, however. What if we get a mix of heads and tails when we flip? When do we stop to check if we have a significant result or not? How do we tell what the actual probability of heads or the probability tails is for an unfair coin? The answer to the first question is coming up, but I’ll save the other two for future blogposts (though check out this on “p-hacking” for a teaser on the second ).

All right. I wrote earlier about the general idea behind tests to evaluate the fairness of a coin or die, now let’s look at a relatively straightforward way to apply this, an algorithm of simple equations that you can use at home. Though my example above may seem to those already a knowledgeable in the statistical arts to touch more on the binomial or multinomial distribution, I’ll show a chi-squared test here because it’s a little easier to compute overall (and plenty of stats textbooks use coins and dice as examples for chi-squared anyway). I won’t touch too much on the inner, math-y workings of it, but if you’re curious, I highly recommend this video which goes into more details.

In short, a chi-squared test works by taking two different “bunches” of observations and checking to see how likely both sets of results came from the same coin, die or whatever. In theory, if you flip a coin 1000 times and then flip it 1000 times more, the second set of flips should be very similar to the first, in terms of total heads and tails. A chi-squared test gives you an explicit probability value that the two bunches came from the same coin. Now, if we _knew_ a coins was fair, we could flip that coin 1000 times and then take another coin we’re not sure about and flip it 1000 times and compare the proportions of heads and tails in both sets. One cool trick is that we can forego finding a truly balanced coin and rather just assume exactly half heads and exactly half tails. Now, how many flips do you need? Generally, more is better but it’s best that the expected value per possible result is at least 5.

All there’s left to do is do the flips for the coin in question and compare. Let’s walk through that now.

Though it may have a scary Greek letter, calculating the values of a chi-squared test is easy. For each different possible result of the random variable in question, in this case, heads or tails. We subtract the expected count for that result from what we observe and then square that difference and then divide by the expected count. We take the square of the difference to make sure it will always be positive and to penalize bigger differences more than smaller ones. We divide the squared difference by the expected count to control for the number of total coin flips, die rolls or whatever. After doing this for each possible result, the next step is simple. Sum up all of these and that’s your chi-squared value. We can then go to a table where we can look up a p-value using the chi-squared value and the degrees of freedom. It’s easy-as-pie to find one of these tables online, but here and here are some examples. Since we’re just looking at a single variable, the degrees of freedom will be N-1, where N is the number of possible distinct results, e.g., for a coin, N=1 (2–1 = 1), for a 6-sided die, N=5 (6–1=5).

Let’s walk through a specific example. Imagine we flip a coin 1000 times and there are 474 heads and 526 tails. We compare this against the expectations of 500 heads and 500 tails and calculate the chi-squared value as such:

Adding those together, we get 2.713 and since there is 1 degree of freedom, the estimated probability that our set of flips and the expected 50/50 split came from the same coin is p=.099. Like the four-headed example at the beginning of the blogpost, this is uncanny but not quite what we’d like — as good statisticians — to determine the coin is in fact not fair. Closest t-value for a our chi-squared value, given 1 degree of freedom (df)

On the other hand, if our bunch of flips yielded 434 heads and 566 tails, our chi-squared value would ultimately be 17.73, which with a single degree of freedom would mean p=.0002 that our coin is fair. In that case, we can say that it’s most likely the case that our coin is unfair.

It’s trivial to change this to apply this to a 6-sided die. It’s the same basic steps, just a few more. We roll our die 1000 times and compare that against the expected count per side the die lands on. Since 1000 is not perfectly divisible by six, the expected counts per roll will be non-integers, but that’s fine; the chi-square test only requires the observed counts to be integers.