Does MTA Turnstile Data Follow Benford’s Law?

Published in

The Startup

6 min readSep 27, 2020

I recently learned about Benford’s Law through the Netflix series Connected, with Latif Nasser. Episode 4: ‘Digits’ explores how this phenomenon applies to music, social media, tax fraud, and more. It’s a fun show and I’d highly recommend checking it out if you’ve got the time.

A quick primer on Benford’s Law from Wikipedia:

Benford’s law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Benford’s law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.

This frequency distribution is visualized below:

Benford’s Law describes the frequency distribution of leading digits in observed data

Since watching the episode, I’ve had it in the back of my mind that I’d like to explore a dataset’s Benfordness. Last week I started Metis’ Data Science Bootcamp and our first project was to conduct an exploratory analysis on MTA turnstile data. With the project behind me, I decided to dig a little more into the data and see how well NYC’s subway traffic patterns fit Benford’s Law.

In this post, I’ll outline my process for data collection and preparation and share some of my findings. I’ll also provide some snippets of the code used along the way. You can find the full details in my Jupyter Notebook on GitHub.

Data Cleaning & Preparation

I decided to import the last four weeks of available data which, at the time, were the weeks of Sept 5, Sept 12, Sept 19, and Sept 26, 2020.

def get_data(week_nums):
    url =     "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    dfs = []     
    for week_num in week_nums:         
        file_url = url.format(week_num)
        dfs.append(pd.read_csv(file_url))
    return pd.concat(dfs)
import_weeks = [200926, 200919, 200912, 200905]

mta_raw = get_data(import_weeks)

Below you can see a view of the raw data set in a Pandas data frame:

The MTA turnstile counters provide cumulative figures for each time period provided (usually 4 hours). For example, we might see 15,470,879 at 4am and then 15,470,894 at 8am. To get the actual number of entries or exits in a given time period, I needed to take the difference between rows.

Kaggle provided a simple approach in their MTA Turnstile Data Analysis article which I used here and added the ‘abs()’ method to account for the few instances where turnstiles are actually counting backwards.

mta_sorted = mta_raw.sort_values(['turnstile', 'timestamp'])
mta_sorted = mta_sorted.reset_index(drop = True)

turnstile_grouped = mta_sorted.groupby(['turnstile'])

mta_sorted['entries'] = 
    turnstile_grouped['entries_cum']
    .transform(pd.Series.diff).abs()mta_sorted['exits'] = 
    turnstile_grouped['exits_cum']
    .transform(pd.Series.diff).abs()

When counters reset, it creates some pretty big outliers. I decided to toss out anything above 10,000 entries or exits per turnstile per time period. This works out to about 40 people per minute, which isn’t unreasonable for some stations.

After scrubbing the data, I set up a few data frames for Benford’s testing.

Finding the leading digit for each entry/exit

Here I set up new columns that would hold the value of the leading digit for each entry, exit, and total value. Values were first converted to strings, spliced to return the first character, and then converted back to integers.

mta['ent_dig'] = mta['entries'].astype(str)
mta['ent_dig'] = (mta['ent_dig']
                  .apply(lambda x: x[0])).astype(int)

mta['ex_dig'] = mta['exits'].astype(str)
mta['ex_dig'] = (mta['ex_dig']
                 .apply(lambda x: x[0])).astype(int)

mta['total_dig'] = mta['total'].astype(str)
mta['total_dig'] = (mta['total_dig']
                    .apply(lambda x: x[0])).astype(int)

Some turnstiles had zero entries or exits for some time periods. Because we’re only interested in getting the leading digits for 1 through 9, I excluded those rows from subsequent data frames by applying a simple filter.

benford_ents = mta[mta['ent_dig'] > 0] 
benford_exits = mta[mta['ex_dig'] > 0] 
benford_total = mta[mta['total_dig'] > 0]

Below you can see a cleaned up data frame with leading digits in the last three columns:

Cleaned up dataframe with leading digit columns

From here, I set up slimmer data frames to check Benfordness.

Setting up the Entries DataFrame

With the zeroes gone, I set up some lighter data frames to work with. The following creates a DataFrame for entries with the value counts of each leading digit.

benford_ents = benford_ents[[
                            'station', 'date', 'time',
                            'timestamp', 'weekday', 
                            'weekday_num', 'ent_dig']]nums = list(range(1, 10))
benford_ents_count = list(benford_ents.ent_dig.value_counts())
benford_ents_df = pd.DataFrame({'digit': nums, 
                              'counts': benford_ents_count},
                              index=nums)

Next, I calculated the total number of counts so I could take the percent frequency of each digit and compare it to Benford’s Law.

total_entries = benford_ents_df.counts.sum()# Creating % Frequency Column
benford_ents_df['percent'] = benford_ents_df['counts']
    .apply(lambda x: (x / total_entries) * 100)
    .round(decimals=1)

# Adding Benford's Law %s, 
# and some calculation columns to see the difference
benford_ents_df['benfords'] = benfords['percent']benford_ents_df['diff_abs'] = 
    (benford_ents_df['percent'] - benford_ents_df['benfords']).abs()benford_ents_df['diff_perc'] = 
    (benford_ents_df['diff_abs'] / benford_ents_df['percent']) * 100

This provided a simple view of the digit counts, percent frequency, and how well it compares to Benford’s distribution:

From the table view above, we can already see that total entries is pretty Benfordy. Plotting the results side by side gives a better view.

Plotting the results

Here we can see Benford’s distribution next to total entries:

Total entries for this time period lines up nicely with Benford’s distribution

I’d say entries lines up quite nicely with Benford’s distribution!

Let’s take a look at exits:

Total exits compared with Benford’s distribution

Pretty spot on.

Curious to see if high traffic stations in Manhattan were somehow biasing the data, I decided to compare a high traffic station in Manhattan with a more residential station in a different borough: Times Square 42nd Street and Prospect Av in Brooklyn.

Comparing Benford’s law with TmSq-42nd St and Prospect Av stations

These both seem to fit Benford’s to a T! We can see a few instances where Times Square fits better than Prospect Av, and vice versa. But just eyeballing the differences, I’d call it a wash as to which station fits best.

Most importantly, the overall principle remains true for each station: frequency of smaller leading digits is greater than larger ones. In no instance is the frequency of a 4, for example, less than any larger number.

Final Thoughts and Suggestions for Further Analysis

Every view I took of the MTA turnstile data seemed to fit Benford’s Law. This raises a beautiful idea that could be better supported with further analysis: the entire MTA subway system shares the same properties as Fibonacci numbers, stock market values, income tax data, and other data observed in the natural sciences.

It makes me think of the NYC subway system as this living organism that obeys certain natural laws, just like any other, because the data ultimately reflects our humanness.

Suggested further analysis

Please feel free to use my notebook as a starting point!

Compare other stations for Benfordness
Sample other time periods, both larger and shorter than 4 weeks
Test for Benfordness at the turnstile level
Are there any views of the MTA turnstile data that disobey Benford’s Law?