Unveiling the Power of Benford’s Law and how it’s Revolutionizing Data Science

Published in

Data Science Student Society @ UC San Diego

10 min readMay 8, 2023

Discover how Benford’s Law, a natural statistical phenomenon, drives cutting-edge advancements in image processing, fraud detection, and financial assets’ manipulation.

The Origins of a Statistical Phenomenon

Think about a time you sat down with a magazine or a textbook. How many pages in did you get before stopping? Chances are, not so many, and you would not be alone in making that decision. A Canadian-born American mathematician and astronomer Simon Newcomb first noticed this trend in 1881 when he observed that the first several pages in a mathematics book were more stained and soiled than any other section. He realized that readers preferred pages that started with the number “1” (i.e. 1 to 19) more than later pages, and this very concept was the crux of a mathematical theory proven decades later.

There’s a plethora of mathematical theories that serve as the foundation of scientific and natural laws. The aforementioned mathematical theory is known as Benford’s Law today, and it’s a coveted treasure in the field of business analytics and data science. It details the phenomena of leading digits in data sets most commonly being “1” and then “2,” all the way down to “9”.

More than 50 years after Newcomb, Frank Benford named the statistical phenomena the “Law of Anomalous Numbers,” after noticing the exponentially decreasing distribution amongst a dataset of more than 20,000 observations. Benford, an American engineer and physicist, also took note of how the frequency of this occurrence changes based on the size of the dataset; if the dataset has triple digit observations, then the geometric series trend — in which the leading digits start with “1”, then “10”, then “100” — holds. On the other hand, if the dataset has up to 10 observations, then it’s too small for the trend to be observable.

A Deeper Dive

In 1995, Benford’s Law was finally proven by American mathematician Theodore Hill and was known by the names “First Digit Law” and the “Significant Digit Law.” Hill’s analysis delved into the different contexts where it was most prominent, uncovering sectors where the law could be applied. For example, the most assorted selection of datasets such as stock prices, grocery bills, and city populations, supports the statistical phenomena. In general, a dataset is said to follow Benford’s law when it satisfies these conditions:

large sample size of generally more than 500 numbers.
all data is recorded in the same unit.
does not have predetermined minimum or maximum limits that restrict certain numbers (i.e, strictly observing the expenditure of dollar bills greater than $20 which would exclude smaller denomination bills).
is not made up of numerical identifiers such as heights in meters in which certain numbers are more likely to be included than others (average human height occurs between 1.5–2 meters).
represents magnitudes of events such as populations or river flows.
has a mean which is less than the median, and the data is not concentrated around the mean.

It is not surprising, therefore, that a lot of empirical evidence within the scientific community supports Benford’s Law. Ranging from constants used in physics to the half-lives of radioactive chemicals, and from inventory accounts to sale flows, the leading digits within such numerical data tend to be smaller.

From a mathematical standpoint, the formula behind the Law is: P(d) = Log (1 + 1/d) with any base; if base of 10 is used and d = 1, the distribution of 1 being the leading digit is approximately 30%, the highest percentage occurrence among the single digits.

**Figure 1:** Benford’s Curve, visualizing the Law, shows the approximate distribution as a percentage of leading digits numbered “1” through “9” (Collins 2).

Benford’s Law has additional properties that should be considered:

Scale invariance: if all data points are scaled or multiplied by a constant, the transformed dataset follows the law. This is important when converting from one unit to another.
Sum invariance: the sums of the significant digits after the decimal point are equal for any leading digit.
Base invariance: in the logarithmic formula, the base of the logarithm does not need to be 10 for the law to hold.

Visualizing Benford’s Law

Another way to visualize the statistical phenomenon is when considering a large-scale raffle that has 200,000 tickets (numbered 1 to 200,000). Between 1 and 10, the probability of a ticket starting with “1” is 2/10 or approximately 20%. If you add 10 more tickets (numbered “1” through “20”), the probability of a ticket starting with “1” now is 11/20 or approximately 55%. Now, if you keep adding tickets, for example numbered “1” through “50”, the probability of the leading digit being “1” goes back down to 14/50 or approximately 28%. This decreasing probability pattern exists until you reach the 100s. Between 100 and 200, the probability increases and then decreases until you reach the 1000s again. If you think about it this way, the simple distribution of numbers is highest for a leading digit of “1” because it naturally occurs first as opposed to sixes or nines.

Real-World Applications

Beyond simple, standalone aggregations which unearth very little information, there have been a multitude of industries in which it has been applied to draw insightful conclusions. Fields such as computer engineering, statistical modeling, and fraud detection have recently seen the Law’s applications yield significant information. Within data science, Benford’s Law can catch anomalies to detect irregularities in patterns and manipulated datasets.

For example, when conducting a regression analysis on a given dataset, you may produce a non-optimal R2 coefficient. If you alter the data to fit the regression curve more closely, you’d produce a higher correlation term. At a first glance, the fit of the number one or starting digit may seem accurate, but when closely inspecting other digits, it’s evident that the dataset is curated. This is because the distribution of the ensuing digits does not follow Benford’s Law. Fortunately for humans, machine learning models nowadays can detect improper distributions of digits and therefore identify fraudulent regression analyses and datasets. Let’s explore some of these real-world applications.

1. Financial Statements

Auditors or modern computers can detect falsified financial statements and over or underrepresented profits by comparing the distribution of first digits with Benford’s Law. Several researchers from multiple countries such as South Korea, New Zealand, the U.K., and the U.S have conducted in-depth research into potential profit manipulation. It is not uncommon that the management executives of companies round up profits in an attempt to deliver more appealing and “fuller” numbers, i.e. from 19 million to 20 million.

Just like other sectors where Benford’s Law can be applied, the intrinsic characteristics of the statistical phenomenon renders itself highly useful. The distribution of first digits is similar to the numerical composition of audits and financial statements, and if these are authentic, then the profits are expected to follow the law. It’s also worth mentioning that the Law is similar to the notion of quantitative audit materiality, defined as “the threshold above which incorrect information in financial statements is considered to have an impact on the decision making by users” (Tammaru and Alver 470). The more research that’s done on how profits can be manipulated, the more relevance it holds for auditors who reach specific conclusions based on the data that is presented. The impacts of such falsification are huge as they can affect future budget allocations, loans, employee compensation, and other aspects of a business.

2. COVID-19

During the COVID-19 pandemic, fake news, curated data, and misleading headlines were proliferating across the Internet. Researchers turned to observe if reported datasets were authentic based on if they followed Benford’s Law. Utilizing the COVID Tracking Project which collects publicly available data about coronavirus in the U.S., researchers Chase Marchand and Dalton Maahs were able to determine how different datasets compared to Benford Law’s predicted distributions. In their findings, the duo used data accessed via the COVID Tracking Project from all states and territories in the country that was updated between January 22, 2020, and October 7, 2020. They also analyzed data from the World Health Organization that tallied COVID statistics for most countries between January 3, 2020 and October 7, 2020.

The following datasets fit Benford’s Law well:

**Figures 2–4**: In order, the following distributions are shown: cumulative confirmed cases, the sum of cumulative confirmed cases and probable cases, and the new daily cases in the U.S. All three categories fit the Law well (Marchand and Maahs 4–5).

**Figure 5:** The distribution of new daily global deaths deviates the most from Benford’s Law and the biggest gap between predicted and actual values occurs for the digit “1” which may be due to greater errors in calculating the number of deaths proportional to confirmed and probable cases across the world (Marchand and Maahs 7).

As shown, 3 out of the 4 categories for the U.S. align with the expected distribution, and this holds for the world data too; this is expected as world outcomes are simply an aggregation of individual countries’ data and these categories for any given country, on average, follow the expected distribution.

The social benefits of utilizing Benford’s Law in this context are clear. As the pandemic was ongoing, it was important to ensure the validity and legitimacy of the datasets that were utilized to draw conclusions about vaccines’ efficacies, masking protocols’ benefits, and other public health measures’ advantages. If data had been manipulated or altered purposefully, it would be difficult and illogical to draw conclusions about that region’s initiatives in combating COVID-19.

3. Image Processing

In forensic image analysis, natural images typically contain gradients of pixels that follow Benford’s law. To validate if an image is unaltered, the first step involves loading an image onto the R workspace. Thereafter, you perform discrete cosine transformations on the data and then check if the data fits Benford’s Law through a Chi-squared test and a Mantissa Arc Test. Through this procedure, we gain an output of the Mean Average Deviation (MAD), the MAD conformity score, and a “distortion factor” which would determine how much the pixels deviate from a Benford distribution. This correlates to the preferential attachment process which indicates that the more a node is “popular,” or is heavily-linked, the more likely it is to gain new links. Throughout the iterative transformation process then, pixels that share gradient codes are more likely to be linked to other similar pixels. For example, a cluster of darkly-shaded pixels has a higher chance of being closely linked with other dark pixels than a lightly-shaded pixel does. If an image is unaltered, naturally, the digits of the pixelation would follow Benford’s Law because it would follow the gradient of brightness and color with lighter shades and darker shades being numbered differently.

**Figure 6:** The Stanford AI Lab depicts how an image can be intercepted and processed by a computer as a matrix of pixels. This is a grayscale image and so its pixels are numbered 0 to 255 with 0 being black and 255 being white, and it’s evident how nodes are typically spread out adjacent to other nodes that have similar values (Stanford Artificial Intelligence Lab 2).

4. Political Elections

While Benford’s Law cannot be used as evidence for voter fraud, discrepancies from the Benford distribution can call for further investigation into voter turnout and election results. It’s important to note that deviations themselves cannot be directly interpreted as fraud because the assumptions of the Law must hold. In several elections, two of the aforementioned assumptions are commonly violated — the votes must be in large magnitudes and cover a wide range of numbers and all the numbers should be equally likely to appear. Therefore, deviations from the expected distribution are not entirely uncommon as oftentimes voter turnout is counted in groups of allocated amounts and different counties, regions, and states have different numbers of registered voters.

**Figure 7**: Distributions of votes such as these are not at all evidence of election fraud. A lot of right-winged media channels during the 2020 elections were claiming that Biden had somehow manipulated voter turnout, when multiple researchers and academic professionals have stated that election fraud cannot be determined based on such distributions unless all assumptions of the statistical phenomenon are met (Bak Coleman et al. 3).

Nonetheless, in contexts where all the conditions of Benford’s Law hold, it can be used as evidence of fraud. This happened in the presidential elections of Iran in 2009 where the second digit in votes for the winner, Mahmoud Ahmadinejad, deviated slightly from the intended outcome. It was determined that electoral fraud had occurred after third-party analysis found that several ballots corresponding to certain candidates were neglected more than what was statistically feasible. Also, hypothesis testing conducted by Columbia University found that the deviation in latter digits combined with the neglect of particular votes was statistically insignificant. The probability of a fair election happening with such outcomes was concluded to be less than 4.20%.

Significance of Benford’s Law

The explosive growth of Big Data has led to both the proliferation of fraudulent information and authentic truth. As such, it’s imperative that naturally-occurring mathematical and scientific laws are constantly compared to reported data to ensure its legitimacy and veracity. In this article, we discussed Benford’s law, a particular statistical phenomenon that claims that the distribution of numerical digits follows a decreasing logarithmic pattern as you go up from the first numeral to the ninth. We saw several various applications of the law: how it was utilized in corporate business sectors to detect profit manipulation on financial statements, how it was used to verify COVID-19 datasets on a nationwide and global scale, how it is helpful in image processing procedures, and how it can detect election fraud if all the conditions of the law are met. By comparing the distribution of leading digits in a dataset to the expected distribution under Benford’s Law, analysts in all fields can detect anomalies and determine if further investigations are warranted. In a world where data is constantly being generated and used to reach important conclusions, Benford’s Law has proven itself to be an integral and transoformative tool for assessing the integrity and accuracy of quantitative data.