The Keys to Building Durable Public Transit Infrastructure

Aadil Saif
INST414: Data Science Techniques
10 min readDec 17, 2023

Overview

Quality public transit is a key part of what makes a city livable and desirable for people. While the mid to late 20th century saw a huge boom in car infrastructure and the creation of the American suburbs, recent trends have seen more and more young people wanting to live car-free lives. While the infrastructure and systems for this have existed in masse in most major metropolitan outside of North America, most US cities don’t have a well regarded public transit system — except a few outliers.

The goal of this study is twofold. First, it is to determine the key factors and habits of cities that have a “good” — this value will be determined through this study using a variety of different metrics outlined below — public transit system. This will allow our potential stakeholders to determine policies and changes to enact in order to better the public transit systems of their respective cities.

The other half of this study will look at the major metropolitan areas (population > 500,000) that do have high ranking public transit systems and will determine smaller metropolitan that are similar to those. This can be used by potential shareholders to determine not only what small cities have the potential to develop into excellent transit systems; but, it can also be used to find smaller cities with a lot of the same qualities as highly desired large ones for those who prefer a smaller city life.

The Data

The data for this study was primarily sourced from 3 databases with publicly available APIs. The first was the National Transit Database (NTD) hosted by the Department of Transportation. This database contained the bulk of the data that was used, as it had complete reported statistics from every single transit agency within the US. This database provided the figures for tracking operating expenses, fare revenues, passenger miles traveled etc. In order to supplement this data with demographic information about each of the Urbanized Areas (UZA) that were considered, the US Census was a great choice due to its breadth of data and coverage of the US population. Figures like population, Area GDPs, Household Income, and Commuting Percentages were drawn from the census database. Finally, the last data source was the American Public Transportation Association (APTA) database. The APTA is a non-profit focusing on developing public transit infrastructure in the US. While this database didn’t have the same coverage of data as the other two sources, it covered the major metropolitans well enough to be used for some of the metrics in determining the Transit Score — the custom metric created in order to grade the existing public transit systems in the major metros. The APTA database was used to determine Congestion Hours, Household Transit Expenditure and Transit Use Percentage.

Both the NTD and APTA databases had a wide variety of options for data ranging from the early to 2000s to as recently as 2022. While initially I was inclined to simply just take the most recent data from 2022 for both of them, after some research I ended up deciding to use the 2019 data for both. This was for two main reasons. First is that the Census is only done every 10 years and the most recent census data was from April 2020, so it would have had data that was closest to the 2019 calendar year. Second, the 2020 and beyond data had a lot of outliers and abnormalities due to COVID, and reduced operations impacted by it. Thus, to avoid this it made most sense to use the 2019 data.

Course Concepts

Data Cleaning and Manipulation
All the data from the various datasets had to be put through a rigorous data cleaning and manipulation process in order to draw apt comparisons between the data. This was required as agencies had to be sorted to their appropriate UZAs and data had to be often aggregated between multiple agencies within the same UZA. Moreover, a statistic for Transit Score had to be created using factors that were deemed to be important dependent variables for how good a transit system is. These factors were normalized for scale differences by calculating the Z-score and creating a total score that was the sum of the z-scores of all the different dependent variables. Those factors were:

  • Percent of transit that is public transit
  • Transit Passenger Miles per Capita
  • Congestion Delay Hours per Capita
  • Household expenditures on Transit as a percentage of Total Household Income
  • Percentage of workforce that commutes using public transit

Linear Regression
Another course concept that was applied was Multiple Linear Regression. This was used in order to model the features against the target (Transit Score). A multiple linear regression was chosen over a series of single linear regressions, as the multiple linear regression is better suited to be able to account for the overlap of correlating features. The multiple linear regression was also used in order to train a predictive model to determine low population cities that have the potential to develop strong public transit systems.

Similarity
Similarity was determined using a pairwise distances function in order to determine the low population cities that are most similar to high population cities with good transit scores. This will allow stakeholders to find smaller cities that are similar to well scoring big cities.

Analysis

Determining Which Cities have good transit
Before any kind of predictions or insights can be drawn it is important to identify what constitutes a good public transit system. While the obvious “known” answers like NYC, Washington DC, and Chicago can be a good starting point, it is important to quantitatively identify what the standard for good transit is. To help visualize this we can take a distribution of the transit scores and create a line histogram, which can then be used to find high performing cities comparative to the rest. The chart modeled below is a visualization of the distribution of Transit Scores.

Using Linear Regression to Determine Important Factors
In order to determine what variables have the highest effect of the Transit Score, a multiple linear regression was employed. The first step of this was to identify the factors, that were going to be the features or independent variables. It was important to user normalized or per capita data for this in order to best account for the massive population differences between different urban areas. The independent variables used as features were:

  • Cost Recovery Ratio
  • Percentage of PMT that is Rail
  • Per Capita Transit Operating Expenses/yr
  • Per Capita GDP
  • Transport Spending/GDP per Capita

The table below shows the coefficients for each of these variables.

Based on the coefficient data, its easy to see that the highest correlating factor was Percentage of Passenger Miles that were Rail. This makes a lot of sense for quite a few reasons. As stated above, the Transit Score took into account things like congestion, amount of passenger miles, and workforce commuting. These are all factors where rail is much better than roadway based transit. Trains have dedicated tracks that they don’t have to share with general public cars which would decrease a lot of the congestion hours that would be created. Furthermore, the reliability of trains means that people are more likely to use it as a regular mode of transport to work rather than a less reliable bus, since people value being at work on time. As for the passenger miles, rail based transit tends to cover longer distances than buses do, so the amount of miles for higher rail based systems would probably be on the higher side.

Another factor, that had a promising amount of correlation with Transit Score was the Cost Recovery Ratio. The Cost Recovery Ratio is a ratio of fare revenue to operating expenses of a public transit system. While initially it would seem like injecting more money into a public transit system would yield better results, there are quite a few reasons this finding would make sense. A higher Cost Recover Ration means that the system has a more consistent amount of funding, and that the system can budget easier and more appropriately ahead of time. This is because additional government spending is very dependent on the political situation at the time, as well as changing government priorities and agendas. For metrics like congestion, and commuter transit, it is incredibly important that a system is reliable and can be planned ahead of time. This is because not only can train schedules be pre-planned to account for changes in ridership but also riders are more trustful of a system they know will be consistent. Thus, a higher Cost Recovery Ratio might translate to a more robust, more reliable transit system.

Both the Transit Operating Expenses and GDP had little to no correlation which isn’t surprising considering both figures were already adjusted per Capita. The big outlier in the coefficient data is the Transport Spending/GDP per Capita. When analyzing the raw data it is clear, that due to the lack of availability of this data for a bunch of the UZAs as well as a large amount of null values is a large contributor to this abnormal coefficient value. Thus for the sake of this study, this data will be ignored.

Using Model to Determine Promising Small Cities
In order to determine what small population cities are most promising, the low population cities had to be put into the previously created model in order to predict what their Transit Score would be, if they had things like congestion data, and transport expenditure for it to be calculated. The table below shows what cities the model predicted would have a positive transit score.

The table above shows the 6 cities that were predicted to have the highest Transit Scores. When looking at the other data for the cities, the reasons why they were predicted to have a positive score becomes more obvious. As mentioned above, rail transit had a very strong correlation with high transit scores. Of the 48 cities classified as low population, almost none of them had any rail infrastructure due to their size and age. The two exceptions to this were Stockton, CA and Anchorage, AK which both are in the table above. Both cities, due to being military base hubs, have a small but present rail infrastructure. Part of this is available and used as part of the city public transit systems that they have. Moreover, due to the density and increased funding that cities with large military bases tend to have, it is no surprise that these two were predicted to have a positive transit score.

The other 4 cities in the model are all cities that are “college towns” in the sense that a significant percentage of both the population and land use in these cities is dedicated towards the college/colleges that are inside of these cities. College towns tend to be very dense and have developed transit systems as a result of needing to move students without car access across their campuses. Thus these cities probably had a higher percentage of people using public transit for commuting as well as higher passenger miles per capita in general which were both major contributors to the calculation of the Transit Scores earlier. Thus, it is no surprise that these 4 cities had a positive predicted Transit Score

Findings for Stakeholders

Key Factors
Based on the findings of this study, it would make sense for policy makers to focus their efforts on developing more rail infrastructure when trying to develop a public transit system. While there is a higher upfront cost to this due to building infrastructure, rail based transit is the single highest found predictor of how good a public transit system is. Moreover, transit systems with more rail were able to recuperate more costs through fairs, thus eventually over time it will result in a lower cost, with the added benefit of creating important infrastructure and the economic benefits that brings.

Promising Cities
For stakeholders looking to find new cities to move in with the connected benefits of a large metropolitan without the size of it, They will probably find what they are looking for in one of Durham, NC; Anchorage, AK; Spokane, WA; Reno, NV; Madison, WI; and Stockton, CA. Additionally, they will probably find the closest experience in cities that either have a large military base in them, or college towns.

Limitations

The largest limitation of this study was the number of factors that were considered for the the Transit Score itself as well as the factors that were used as the features of the regression model. Most of the findings and analysis done in this study is centered around the derived Transit Score. This score is an amalgamation of different factors that were deemed to be important to evaluating whether a transit system is good. However, it is important to recognize that this score and the factors used to determine it are inherently subjective may not accurately or completely reflect the realities of these systems. Furthermore, the data that was used in order to calculate the scores was closely linked to the data that was used as the features for the study. This creates an inherent bias in the association between certain variables and can make certain correlations appear inflated.

Moreover, this study was also limited in the fact that there were a lot of factors that weren’t/couldn’t be found data for, that play a large role in affecting public transit success. These include things like political affiliations, voting patterns, weather patterns, the cultures of the city, the industries/fields that the city is a hub for, education of the populace, age data, etc. This data was not accounted for in this study due to a variety of reasons such as time, scope, and general availability of this data especially for the metro areas outside of the largest 10 or 20. A preview of this limitation was seen with the abnormal data pulled from the Transport Spending/GDP per Capita coefficient. This was due to this data not being as accurate or as available for UZAs outside of the largest few thus, not being able to be modelled properly by the linear regression model.

Source Code :https://github.com/aadilsaif/INST414Final

--

--