If you haven’t heard through the news, social media, or your math teacher, today is the pi day of the century. At 9:26:53 (AM or PM, your preference), the world will rejoice with fruit-filled desserts as the date hits 3/14/15 9:26:53 (and 59 milliseconds if you’re quick enough). All over twitter, people from across the world have expressed their love for the infinite constant, and even the President showed his appreciation of the Greek letter:
All this celebration got me thinking about pi, and the more I read, the more I thought about a certain question:
“Are the digits of pi randomly distributed?”
Does the constant that we know and love give favor to certain digits? If so, then is there a reason why? I've recently learned about Chi-Square (pronounced “Kai”) tests in my Statistics class, so I decided to perform a χ² Goodness-of-Fit test to find out.
A Chi-Square Goodness-of-Fit test is defined as:
A test applied when you have one categorical variable from a single population. It is used to determine whether sample data are consistent with a hypothesized distribution.
The formula for a Chi-Square statistic is:
o is the number you got
and e is the number you expected
When you add these all up for each item in you data-set, you get the Chi-Square statistic. This number can be used to calculate the probability of getting your observed number if what you expected was truly accurate.
So in basic English, a Chi-Square test tells you if a certain set of numbers differs from an expected outcome.
So what, what’s the use of it?
Let’s say you own a small supermarket, and you buy pies from the local bakery every weekend to sell at your store. The baker says he can’t promise exactly how many pies he’ll give you, but for a discount, he’ll randomly throw pies in a box for you, and sell it at a cheaper rate. He claims that he makes pies at the following proportion:
Bob’s Pie Distribution:
Apple — 0.5
Blueberry — 0.25
Cherry — 0.15
Blackberry — 0.1
Now one day you walk into the store, and see that your shipment of pies has came in. You open the box, and out of the 100 pies you ordered you received:
Apple — 44
Blueberry — 26
Cherry — 12
Blackberry — 18
Now you think the baker might have been lying to you, but you don’t want to confront him and ruin your good relationship if he actually told the truth. What do you do?
You use a Chi-Square test to find the probability he’s telling the truth.
You break out your textbook, find the equation for a Chi-Square test, plug in the numbers, and start calculating. You find that there is a 1/5 chance of getting a box like this if he’s telling the truth, so you can’t completely count him out. This is enough for you to brush it off, and call him up to thank him for his wonderful service.
Chi-Square Tests and Pi
Now that we have a basic understanding on how the test works, we can use it to see if pi is truly randomly distributed.
If pi is truly random, we can expect each digit to have a 10% chance of appearing ([1 digit appearing from the digits 0–9] = 1/10 = 10%).
Tallying each digit of pi would be very, very tedious, so I decided to use a computer program to count for me. For you programmers:
I decided to use python because of it’s quick scripting ability. Creating a function to calculate the chi-square statistics would be re-inventing the wheel, so I decided to use the packages numpy and scipy to find it. This is my first real python data-science project so if I should improve anything please let me know!
I managed to whip up this code to test:
For those of you who aren't programmers (though I still think you can understand what’s going on), here’s what I did:
- Open up a file that has the value of pi to 100 digits
- Go through each digit of pi and tally it in a list
- Compare the tallied list with the expected list
- Return back the p-value, or the probability of getting what I got if pi was truly random
In statistics, we have evidence for a claim when a p-value is less than our “alpha-level”. The usual alpha-level used is 0.05, so unless something has a 5% or lower chance of happening, we can’t prove that it’s not normally expected.
After running the test a few times with various lengths of pi, I got the following p-values:
For 100 digits of pi, I received a p-value of 0.924
For 1,000 digits of pi, I received a p-value of 0.861
For 10,000 digits of pi, I received a p-value of 0.403
For 100,000 digits of pi, I received a p-value of 0.905
If I were in class, my teacher would tell me to write:
“Because our p-value of 0.905 is greater than our alpha-level of 0.05, we fail to reject the null hypothesis. We don’t have evidence to show that the digits of pi differ from an equal distribution.”
But here, it’s simpler to write:
All of our p-values are well above 0.05, so it’s safe to say that pi doesn't favor any digit more than the others.
Yay for Equality!
I hope you've had a great pi day, see you in another 100 years!