Exploratory Data Analysis on NJ Transit Rail Performance in 2019

Simon Miller
INST414: Data Science Techniques
9 min readFeb 10, 2024

Introduction

Trains are an integral part of transportation in and out of New York City. NJ Transit (Transit) is a critical part of this network. Transit provides several heavy rail commuter services across New Jersey with a significant focus to serve New York City at two terminus stations, New York Penn Station and Hoboken. This data analysis exercise will focus on the lines that serve these two stations. The subset of data used in this analysis comes from Kaggle users Pranav Badami and mzhang13’s “NJ Transit + Amtrak (NEC) Rail Performance” dataset. This dataset covers rail performance of Transit and Amtrak trains across Transit’s heavy rail network from March of 2018 to May of 2020. Considering the context of this dataset, this data analysis exercise will answer the question; What time-trends are there for NJ Transit rail performance in 2019?

The Question

Two significant stakeholders could ask this question: NJ Transit and the commuter. Transit may ask this question to know which time-period that needs rail performance improvement. For example, if trains are typically delayed around November and December for holidays, Transit could add extra trains in anticipation of more passengers. Commuters may ask the question to know which period of time that they need to anticipate delays. For example, they may attempt to catch an earlier train if they are aware trains are typically late on that particular day.

Data Subset

The “NJ Transit + Amtrak (NEC) Rail Performance” subset of data comes from Kaggle users Pranav Badami and mzhang13. As per the Kaggle page, the data is sourced from a web scraper that gathered data from Transit’s “DepartureVision” service, which provides a live status of services on Transit’s network. The dataset covers rail performance of Transit and Amtrak trains from March of 2018 to May of 2020. It contains 29 CSV files, with each file containing a months’ worth of data with an additional two files that contain invalid trains. Each file, except for the invalid train files, contains the following columns:

· date

· train_id

· stop_sequence

· from

· from_id

· to

· to_id

· scheduled_time

· actual_time

· delay_minutes

· status

· line

· type

Data Cleaning

For this exercise, most of the columns are irrelevant. All columns except for line and delay_minutes were removed. These two columns are the only critical information that will be used to create the graph. Since the data for 2018 and 2020 both do not have complete data for the entire year, it was decided to use only data from 2019. Since all the monthly files together are a large amount of data, this helped to increase the speed of the program. However, the program can handle additional monthly data as provided in the Kaggle dataset.

All Amtrak trains excluded several pieces of data, including the critical delay_minutes column. This led to the decision to remove all Amtrak trains from the scope of this analysis. Similarly, data for the Princeton Shuttle and the Atlantic City Line were incomplete and thus removed from the scope entirely. With the removal of these lines and the Amtrak trains, only the following lines were left within the data:

· Bergen County Line

· Gladstone Branch

· Main Line

· Montclair-Boonton Line

· Morristown Line

· North Jersey Coast Line

· Pascack Valley Line

· Northeast Corridor Line

· Raritan Valley Line

Although having the additional monthly datasets, the Princeton Shuttle, the Atlantic City Line, and the Amtrak trains would have added more value to the analysis, the reduced scope allowed for a dedicated focus on Transit rail lines solely serving New York City commuters in 2019.

The following map showcases the chosen lines and how they serve New York City via NY Penn Station and Hoboken.

Source: https://www.njtransit.com/

Data Analysis

What time-trends are present for NJ Transit rail performance in 2019?

To answer the question, it was decided to find the monthly averages for each line in anticipation of using the averages as points on a graph. The program was written in Python primarily using Pandas to store information in Dataframes and Matplotlib to graph the results. The first function handles cleaning the data and finding the mean for each line:

def cleanAndMean(filename):
"""
This function takes in a .csv file that contains train performace data and turns it into a data frame containing average
train delays for the month.
The following columns are removed: stop_sequence, data, train_id, type, from, from_id, to, to_id, scheduled_time, actual_time and status.
The following lines are removed: Atl. City Line and Princeton Shuttle
Note that each file contains one month worth of data.

Args:
filename(str): A .csv file
Returns:
dfNew(DataFrame): A pandas DataFrame containing only the line name and average delay in minutes for the month.
"""

#Clean the columns up
dfOriginal = pd.read_csv(filename) #open the .csv file
dfOriginal.drop(columns=['stop_sequence', 'date', 'train_id', 'type', 'from', 'from_id', 'to', 'to_id', 'scheduled_time',
'actual_time', 'status'], inplace=True) #drop unwanted columns
dfOriginal.dropna(inplace=True) #drop anything that is NA

#Find the mean for each line and group by it
dfNew = dfOriginal.groupby(['line']).mean() #group by the each line's mean late time
filename = filename[-11:-4] #find the year and month
dfNew = dfNew.rename(columns={"delay_minutes": f"avg_delay_{filename}"}) #add year and month to column name

#Remove unwanted Lines
for index, rows in dfNew.iterrows(): #iterate through rows

if index == "Atl. City Line" or index == "Princeton Shuttle": #find Atl. City Line and Princeton Shuttle lines...
dfNew.drop(index, inplace=True) #...remove the row

return dfNew #return the data frame

The “cleanAndMean” function removes the unnecessary columns, groups the rail lines together, finds the mean for each rail line, recorded the year and month, removed the two unwanted rail lines and then returned a Pandas DataFrame. The DataFrame contains the mean for each line over one month. The main function of the program combines DataFrames into lists of averages for each line:

def main(filenames):
"""
The function sorts through files, finds averages using the cleanAndMean functions and uses matplotlib to display two graphs,
witch each containing monthly average values.
The first graph will show average data for the following NJ Transit Lines:
Bergen County Line
Gladstone Branch
Main Line
Montclair-Boonton Line
Morristown Line
North Jersey Coast Line
Northeast Corridor Line
Pascack Valley Line
Raritan Valley Line
The second graph will contain a monthly average for all the above lines and the following two, targeted lines that are of interest:
North Jersey Coast Line
Pascack Valley Line

Args:
filenames(str): A list of .csv files
"""

#create empty list for x and y values
dates = []
yBergenValues = []
yGladstoneValues = []
yMainValues = []
yMontclairValues = []
yMorristownValues = []
yNoValues = []
yNortheastValues = []
yPascackValues = []
yRaritanValues = []
yAvgValues = []


for file in filenames: #for each file...

avgList = [] #create empty avg pist

df = cleanAndMean(file) #create the dataframe
columnName = df.columns[0]
date = columnName[10:] #pull the date from the columns
dates.append(date) #add to the dates list

#Pull and append data for Bergen Co. Line
avgDelay = df.iloc[0][columnName]
yBergenValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Gladstone Branch
avgDelay = df.iloc[1][columnName]
yGladstoneValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Main Line
avgDelay = df.iloc[2][columnName]
yMainValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Montclair-Boonton
avgDelay = df.iloc[3][columnName]
yMontclairValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Morristown Line
avgDelay = df.iloc[4][columnName]
yMorristownValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for No Jersey Coast
avgDelay = df.iloc[5][columnName]
yNoValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Northeast Corrdr
avgDelay = df.iloc[6][columnName]
yNortheastValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Pascack Valley
avgDelay = df.iloc[7][columnName]
yPascackValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Pull and append data for Raritan Valley
avgDelay = df.iloc[8][columnName]
yRaritanValues.append(avgDelay)
avgList.append(avgDelay) #append to avg list

#Find the average for the month and append it to the list
yAvgValues.append(stat.mean(avgList))

#print(f"{df.iloc[1]} with avg delay of {avgDelay}")

#print(dates)

#Create arrays out of the values
xDates = np.array(dates)
yBergenPoints = np.array(yBergenValues)
yGladstonePoints = np.array(yGladstoneValues)
yMainPoints = np.array(yMainValues)
yMontclairPoints = np.array(yMontclairValues)
yMorristownPoints = np.array(yMorristownValues)
yNoPoints = np.array(yNoValues)
yPascackPoints = np.array(yPascackValues)
yNortheastPoints = np.array(yNortheastValues)
yRaritanPoints = np.array(yRaritanValues)
yAvgPoints = np.array(yAvgValues)

#Create Full Graph
plt.plot(xDates, yBergenPoints, c = '#BBCBE2') #plot Bergen points in blue-silver.
plt.plot(xDates, yGladstonePoints, c = '#A1D5AE') #plot Gladstone points in mint.
plt.plot(xDates, yMainPoints, c = '#FFD006') #plot Main points in yellow.
plt.plot(xDates, yMontclairPoints, c = '#E66D5C') #plot Montclair points in salmon.
plt.plot(xDates, yMorristownPoints, c = '#00A850') #plot Morristown points in green.
plt.plot(xDates, yNoPoints, c = '#00A3E4') #plot no points in blue.
plt.plot(xDates, yNortheastPoints, c = '#EE3A43') #plot Northeast values in red.
plt.plot(xDates, yPascackPoints, c = '#A0218C') #plot Pascack points in purple.
plt.plot(xDates, yRaritanPoints, c = '#FBA536') #plot Raritan points in in orange
plt.show() #show the full graph

#Create Focused graph
plt.plot(xDates, yAvgPoints, c = 'black', linewidth = '4') #plot average values in black.
plt.plot(xDates, yNoPoints, c = '#00A3E4') #plot no points in blue.
plt.plot(xDates, yPascackPoints, c = '#A0218C') #plot Pascack points in purple.
plt.show() #show the focused graph

The main function receives a list of CSV files from the “if name equals main statement.” These filenames are iterated over and a DataFrame is created using the “cleanAndMean” function. The dates for each DataFrame and the average for each line are added to a list. Using Matplotlib the data is graphed with the year and month as the x-axis and the average time late in minuets on the y-axis. The function creates two graphs. Graph 1 shows all the lines and Graph 2 has a focus on two lines and the overall average, both showing significant trends.

Above: Graph 1
Above: Graph 2

One of the most significant characteristics of the two graphs is the peak of time delayed in July. According to the average in Graph 2, delays are greater by approximately a minute in July compared to April. All lines in Graph 1 show this increase of delays and subsequently the average time delayed in Graph 2 follows. The average shows a bell-curve with a tail starting in April, followed by the peak three months later in July, and another tail ending two months later in September. Graph 1 and the average in Graph 2 showcases that the increase of time delayed were felt system wide. With the North Jersey Coast Line being an outlier, which dramatically effects the average, it is difficult to tell the standard deviation of the bell-curve graphically.

The two lines of focus in Graph 2 are the North Jersey Coast Line and Pascack Valley Line. The North Jersey Coast Line shows significantly greater average delays during the delay event surrounding July by upwards of one and a half minutes. On the other hand, the Pascack Valley Line shows the opposite. It showed the lowest rise in time delayed during July. Additionally, the slope of the graphical line between May and June is greater than between June and July. This is unlike any other line on Graph 1.

Conclusion

Overall, the following conclusions on time-trends can be made: There is a significant peak of time delayed in July compared to any month in 2019; The North Jersey Coast Line had the most time delayed in July and the Pascack Valley Line had the least time delayed in July.

This exercise was limited and there are many areas that can be improved on. Plotting the daily average of each rail line instead of the monthly average would allow for a more detailed analysis, especially for the July delay increases and how long they occurred. Performing research on why delays are so prevalent in July would also add to the analysis.

Furthermore, there may be bias in how the data is measured and gathered. The accuracy of the times created Transit’s “DepartureVision” and the accuracy of the web scraper used by Pranav Badami and mzhang13 could influence the data in negative ways. Regardless, a further and more detailed analysis surrounding the July 2019 peak of delay times should be conducted.

Resources

Dataset: https://www.kaggle.com/datasets/pranavbadami/nj-transit-amtrak-nec-performance

DepartureVison: https://www.njtransit.com/dv-to

GitHub Repository: https://github.com/smiller1551/RailPerformanceAnalysis

This Medium post was created by Simon Miller at the University of Maryland — College Park for INST414: Data Science Techniques under Professor Cody Buntain.

--

--