What is Benford’s Law?
Most people, (if they have ever thought about it), assume that numbers (e.g. 1,000 or 57 or 999 or 23,486,171,840,111,538) are equally likely to start with a 1 as they are to start with a 9.
Benford’s Law is an observation that the frequency of leading digits in many real-life sets of numerical data is not evenly distributed. In fact it looks like this…
About 30% of the time, the leading digit is a 1. About 5% of the time it is a 9.
How can I create a Benford’s Law Distributed Dataset?
I wanted to test this, so I generated a million numbers (between 1 & 10,000 as follows using Python :
#a list to store the generated random numbers
number_set = 
#Generate 100,000 random numbers
for x in range(100000):
#pick numbers between 1 and 10,000
Now extract all the leading digits
##A list to store the leading digits
first_digit_set = 
#a method to get the leading digit
#convert the number to a string
#take the first character
#convert back to an integer and return the value
for d in number_set:
Now show the results
for i in list(range(1, 10)):
print("There are " + str(first_digit_set.count(i)) + " leading " + str(i) + "'s")
There are 33513 leading 1's
There are 33181 leading 2's
There are 33140 leading 3's
There are 33707 leading 4's
There are 33461 leading 5's
There are 33133 leading 6's
There are 33286 leading 7's
There are 33419 leading 8's
There are 33170 leading 9's
The numbers are evenly distributed!!
One of 2 things has happened.
1: A genius mathematical defined a law that is wrong (hint: it’s not this one)
2: I have done something wrong
A: It turns out that Python’s Standard Library’s random module generates numbers with an even distribution. Remember Benford’s Law is an observation that the frequency of leading digits in manyreal-lifesets of numerical data is not evenly distributed.
How can you generate data with a pre-defined distribution (using Python 3)?
How can you generate data with a Benford’s Law distribution?
Well, since Python 3.6 (I think) the random modulehas had a method called
random.choiceswhich allows you to specify weights andthe number of items to generate…
from random import choices
#specify a list of values to generate occurrenced of
#these are the digits we was as leading digits
population = [1, 2, 3, 4, 5, 6, 7, 8, 9]
#Specify the weights
#these are the Benford Law weights)
weights = [0.301, 0.176, 0.124, 0.096, 0.079, 0.066, 0.057, 0.054, 0.047]
#generate sample first_digit set with Benford disctibution
#k = 10**6 generates 1 million values
first_digits = choices(population, weights, k=10**6)
from collections import Counter
#use the standard library's counter module to show the result
And there you go. A list of one million numbers displaying a Benford’s Law distribution. Let’s plot it on a chart to validate.
import numpy as np
import matplotlib.pyplot as plt
#Genrate random dataset
count = 
for c in Counter(first_digits).most_common():
#sets spaces to put company labvels into
y_pos = np.arange(len(population))
#set size of the whole chart
# Create names
plt.ylabel('LEading Digit Count')
# Create bars and choose color
plt.bar(y_pos, count, color = 'pink')
# Limits for the Y axis