Amtracks network

Published in

INST414: Data Science Techniques

6 min readMar 5, 2024

The question that I set out to answer for this assignment was how are good moved throughout the United States. This turned out to be a question that was much too vague to answer effectively as there are too many ways that data moves. For simplicity I choose to focus on freight rail, rail is simple since it requires physical track which serves as a physical connection between set stations. This type of transit is perfect for a network analysis with rails representing the edges of a network and stations could represent a node. In my search for data to answer this question I had to pivot away from freight, as the US freight network is split between a number of diffrent private companies and the association that collects this type of data across all companies wants to be paid for the data.

Instead I turned to passenger rail which is majority publicly owned in the US and Amtrack releases live data on their train network to the public through the use of an API. Given this data a question that a Amtrack executive may ask is: what stations see the most travel? This data could be used to determine which stations get upgrades for passengers, which require additional rail upkeep due to increased usage, or inform staffing decisions.

The data that I used for this analysis comes from the Amtraker website/API which pulls data directly from Amtrak’s train tracking website. This API gives a amazing amount of information for free, live data is provided for every train currently running including each station that the train has or will stop at. Because the data is live, recreating this might give you a slightly different graph especially outside Amtrak’s Northeast corridor as their westward trains run differently depending on the daily schedule.

#Collecting the data

The API made this part of the task quite simple, one line from the requests package could have done the job but I broke it out into a few lines for readability.

url = 'https://api-v3.amtraker.com/v3/trains'
response = requests.get(url)
train_data = response.json()

The API returned a nice JSON format of all the data that I needed so I was ready to move onto cleaning.

#Cleaning the Data

In order to make my network I just needed stations, and the connections between them, not all of the additional information. I created two lists in order to do this:

Train lines — contained all of the stations that a train would pass through on one trip
Train Stations — This was a simple list of all of the unique stations to serve as my nodes.

train_lines = []
train_stations = []

for train in train_data: # loop through each train
    line_stations = []
    stations = train_data[train][0]['stations'] # the stations each train stops on 
    for station in stations: #loop through stations 
        line_stations.append(station['name'])
        #also make a list of unique stations
        if station['name'] in train_stations:
            pass
        else:
            train_stations.append(station['name'])
    train_lines.append(line_stations)

Now that I had the lists I needed I could start creating my network by first creating the nodes. This next block has some additional code other then just creating a node for each station which I will go into further when discussing my visualizations, but the main idea is that each unique train station that I had found previously became a node.

#create a node for each station
for station in train_stations:
    try:
        location = geolocator.geocode(station,country_codes="US")
        g.add_node(station, long=location.longitude, lat = location.latitude)
    except:
        pass
    time.sleep(1.1) # api limit 1 request a second to geo locate

Next I needed to create the edges which represented the track that connected each station in Amtrak’s network. I chose to have each edge represent a train that would run between the two stations. A edge with a higher weight means that more trains ran on that track.


for line in train_lines:
    for left_station, right_station in zip(line, line[1:]):
        print(f"{left_station} -> {right_station}")

        if((left_station,right_station) in g.edges):
            if g.get_edge_data(left_station,right_station) != "{}":
                
                current_weight = g.get_edge_data(left_station,right_station)['weight']
        else:
            current_weight = 0

        #add an edge for each line
        g.add_edge(left_station,right_station, weight = (current_weight+1))
        if ((left_station == 'Malta' )| (right_station=='Malta')):
            print(f'Adding edge {left_station} -> {right_station}: {current_weight}')

With that my graph was done and I could export my network in order to make some visualizations as well as starting to determine which nodes or stations were the most important.

#Analysis

Starting by looking at the stations, in order to determine which stations are most important I looked at the stations with the highest degree of centrality. Philadelphia’s 30th street station and Connecticut's New Haven Union station both were connected to 9 other stations making them the two most important by centrality. After that there were 5 stations which all had 7 connections to other stations which were Penn stations in Baltimore, Newark, and NYC as well as Alexandria VA and Chicago Union Station. By centrality these are the most important stations.

most_used_stations = sorted(dict(g.degree).items(), key=lambda word: word[1], reverse=True)[:10]
for (station,c) in most_used_stations:
    print(f"{station}, {c}")

However I think it is also important to consider the edge weight which in this example is how many trains run on a track. This is arguably the more impressive metric to look at, on the rails Newark NJ to Penn station in NYC a impressive 51 trains ran on the date of writing.

sorted_edge_list = sorted(g.edges(data=True),key= lambda x: x[2]['weight'],reverse=True)[:5] #This code borrowed from github

for (s1, s2, w) in sorted_edge_list:
    print(f'From: {s1} to {s2}: {str(w).split(" ")[1].split("}")[0]} trips')

Moving on to some visualizations the below image represents a snapshot of Amtrak’s rail network.

The way that this network was shaped surprised me in a few ways, there were a number of nodes only connected to one other node and only had one path to get to them. I would have thought that the end of one train line would mark the beginning of another train line and have made the network into a more connected circular graph.

The next thing that I wanted to try and do was visualize the network based on the physical locations of the stations. I was able to do this by using the Geo Layout plugin for Gephi but needed the longitude and latitude coordinates for each station. When creating each node I also used the name of the station (which usually represented a town) to get it’s cordinates.

location = geolocator.geocode(station,country_codes="US")

Using the Geopy library and the Nominatim API I was able to turn the station names into coordinates, then importing the data into Gephi I was able to create this visualization (if you are recreating this make sure to use the US country codes or many of the stations will appear in Europe).

This looks like the United States and overall looks pretty good but there a few issues and stations appearing in Mexico and the middle of the ocean. In total 4 station had been unable to be geo coded automatically and I had to manually fill in the values using google maps. After the corrections the graph looked like this.

While the edges don’t represent the physical locations of the tracks connecting each station each station looks to be in its proper place and based on my small amount of subject knowledge the visualization looks great!

#Limitations

There are a few limitations to the implementation that I used, first the data from the API is live meaning that it will only show trains that are currently on the rails and routes currently in service. Since I started this during business hours during the week I feel confident that I captured a large part of the Amtrak network but likely will never be able to capture 100% of their stations and routes with this implementation.

In addition the edges in my network only exist if the train is carrying passengers and the route has been advertised. There are likely other rails that the trains use when they are not in service which would not be reflected in this process.

It is also possible that for the map visulization the automatic geo coding could have put stations in the incorrect location, but based on a small selection I checked they were all correct.

Github Repository: https://github.com/not-senate/module2_assignment.git

Amtracks network

Written by Jdavitz