Benford’s Law

What is Benford’s Law?

Most people, (if they have ever thought about it), assume that numbers (e.g. 1,000 or 57 or 999 or 23,486,171,840,111,538) are equally likely to start with a 1 as they are to start with a 9.

Benford’s Law is an observation that the frequency of leading digits in many real-life sets of numerical data is not evenly distributed. In fact it looks like this…

About 30% of the time, the leading digit is a 1. About 5% of the time it is a 9.

How can I create a Benford’s Law Distributed Dataset?

I wanted to test this, so I generated a million numbers (between 1 & 10,000 as follows using Python :

import random
#a list to store the generated random numbers
number_set = []
#Generate 100,000 random numbers
for x in range(100000):
#pick numbers between 1 and 10,000
number_set.append(random.randint(1,10001))

Now extract all the leading digits

##A list to store the leading digits
first_digit_set = []
#a method to get the leading digit
def get_leading_digit(number):
#convert the number to a string
#take the first character
#convert back to an integer and return the value
return int(str(number)[:1])
for d in number_set:
first_digit_set.append(get_first_digit(d))

Now show the results

for i in list(range(1, 10)):
print("There are " + str(first_digit_set.count(i)) + " leading " + str(i) + "'s")

There are 33513 leading 1's
There are 33181 leading 2's
There are 33140 leading 3's
There are 33707 leading 4's
There are 33461 leading 5's
There are 33133 leading 6's
There are 33286 leading 7's
There are 33419 leading 8's
There are 33170 leading 9's

D’oh!!

The numbers are evenly distributed!!

One of 2 things has happened.

1: A genius mathematical defined a law that is wrong (hint: it’s not this one)

OR

2: I have done something wrong

A: It turns out that Python’s Standard Library’s random module generates numbers with an even distribution. Remember Benford’s Law is an observation that the frequency of leading digits in manyreal-lifesets of numerical data is not evenly distributed.

So…

How can you generate data with a pre-defined distribution (using Python 3)?

How can you generate data with a Benford’s Law distribution?

Well, since Python 3.6 (I think) the random modulehas had a method called random.choiceswhich allows you to specify weights andthe number of items to generate…

from random import choices
#specify a list of values to generate occurrenced of
#these are the digits we was as leading digits
population = [1, 2, 3, 4, 5, 6, 7, 8, 9]
#Specify the weights 
#these are the Benford Law weights)
weights = [0.301, 0.176, 0.124, 0.096, 0.079, 0.066, 0.057, 0.054, 0.047]
#generate sample first_digit set with Benford disctibution
#k = 10**6 generates 1 million values 
first_digits = choices(population, weights, k=10**6)
from collections import Counter
#use the standard library's counter module to show the result
Counter(first_digits).most_common()

(1, 301193),
 (2, 175999),
 (3, 123747),
 (4, 95958),
 (5, 79342),
 (6, 65449),
 (7, 57246),
 (8, 53951),
 (9, 47115)

Woo Hoo!

And there you go. A list of one million numbers displaying a Benford’s Law distribution. Let’s plot it on a chart to validate.

import numpy as np
import matplotlib.pyplot as plt
#Genrate random dataset
count = []
for c in Counter(first_digits).most_common():
count.append(c[1])

#sets spaces to put company labvels into
y_pos = np.arange(len(population))
#set size of the whole chart
plt.figure(figsize=(10, 10))
# Create names
plt.xticks(y_pos, population)
plt.ylabel('LEading Digit Count')
plt.title('Digit')

# Create bars and choose color
plt.bar(y_pos, count, color = 'pink')

# Limits for the Y axis
plt.ylim(0, int(max(count)*1.1))

plt.show()
Benford’s Law Distribution of Leading Digits
Like what you read? Give Alex Freeman a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.