Benford’s (and Newcomb’s) Law!

Several years ago, I listened to a Radiolab story that really stuck with me. The theme for the entire episode was numbers, but more specifically, the program had a forensic accountant discuss how he could show someone had committed fraud using something called Benford’s Law.

Before there were calculators, mathematicians would have to do complicated math with the aid of logarithmic tables (example shown below). In 1881, an astronomer named Simon Newcomb, would carry out logarithmic equations using a book of these tables. He noticed that the earlier pages of the book (i.e. the tables for numbers 1, 2, 3) were more worn out than the later pages. This led Newcomb to form the principle that, in any list of “natural” numbers, more numbers will tend to begin with “1” than with any other digit. Newcomb considered natural numbers those numbers you would run into in the course of everyday life.

https://en.wikipedia.org/wiki/Abramowitz_and_Stegun

Then, in 1938, a physicist working for General Electric named Frank Benford rediscovered Newcomb’s observations and came up with his eponymous Law.

Benford’s Law is the mathematical theory of leading digits. Specifically, in data sets, the leading digit(s) is (are) distributed in a specific, nonuniform way. While one might think that the number 1 would appear as the first digit 11 percent of the time (i.e., one of nine possible numbers), it actually appears about 30 percent of the time... Nine, on the other hand, is the first digit less than 5 percent of the time.” — Journal of Information Systems Audit and Control Association (ISACA)
https://www.isaca.org/Journal/archives/2011/Volume-3/Pages/Understanding-and-Applying-Benfords-Law.aspx
http://www.datagenetics.com/blog/march52012/

Benford’s formula states that the probability of the leading digit being of a certain value can be described by this function.

In python, Benford’s formula would look like this and matches the frequencies by digit of the bar graph above.

Benford was able to validate his Law with 20 different sets of data, including the surface areas of 335 rivers, the sizes of 3259 U.S. populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an actual issue of Readers’ Digest, the street addresses of the first 342 persons listed in American Men of Science, and 418 death rates.

As a (very new) data science student, I wanted to demonstrate Benford’s Law using my new python skills. Back in the 1930s, Benford pored over 20 different data types across countless days, maybe even months, or years. I could look at a pretty sizeable data set in an afternoon and see if it followed Benford’s Law.

These are the steps I took:

1. Get some data!

The type of data set itself didn’t matter too much. I opted for a large amount of data: Labor Force numbers from the World Bank. The numbers went back a couple of decades and they were available by country, all in convenient CSV format. I used the pandas library to read the file and return numbers for 1990, 2000, 2010, and 2017.

2. Clean the data!

After converting the table into a list of dictionaries, I noticed there were some pesky NaN results. What even is a NaN? I played around with some boolean functions and found a math.isnan function that helped me filter out any countries that had NaN results.

I went from a data set of 264 countries to one with 232. Not a terrible loss.

3. Get those first digits and make tuples!

I extracted all the first digits from every year’s data set and put them in a tuple per country. I didn’t track which country they were associated with because I just wanted to read the numbers.

4. Put each year’s data into its own list!

5. Calculate the frequency at which each number (1 through 9) occurs!

And just because I now had a data set of 928 first digits, I calculated the total frequencies:

6. Compare to Benford’s Law!

Did my data match up to Benford’s Law? It was prrrretty close:

Benford’s Law is now being used more frequently. Lately, it has been applied in accounting fraud detection, admitted as legal evidence in U.S. criminal cases, and even used to show evidence of election fraud. As a data scientist, it is a simple and handy formula to use to verify the reliability of your data.