How to Find a Freight Train Problem in a Pile of Amtrak Tweets

6 min readDec 21, 2021

The world of natural language processing is diverse, encompassing approaches as simple as word counts all the way to complex deep neural networks. In this article, we’ll take a look at some basic NLP methods to perform my analysis of Amtrak tweets. We’ll be looking at regular expressions, n-grams, and methods of n-gram selection — let’s check it out!

The first step to dealing with a pile of Amtrak tweets is, of course, to acquire a pile of Amtrak tweets. There are a variety of ways to do this, but I found snscrape to be the simplest:

import snscrape.modules.twitter as sntwitter
tweets = [ (t.date, t.id, t.content) for _, t in zip(range(10000), sntwitter.TwitterSearchScraper('from:AmtrakAlerts').get_items())]

In one line, we have the last 10,000 tweets from @AmtrakAlerts, the Twitter account in charge of announcing delays and other information about Amtrak trains outside the Northeast Corridor. The question, then, is how do we extract the delay information? 10,000 tweets is right in a sweet spot where it is far too many to analyze by hand, but not enough (especially with the short length of tweets) to train any fancy natural language processing tool. There are some less-fancy methods we could use to identify and extract patterns in the Amtrak tweets, but in this case they prove unnecessary. The tweets are clearly at least partially automated, with many tweets sharing the same format:

Lakeshore Limited Train 448 which departed Chicago (CHI) on 12/19 is currently operating approx. 2hr late due to earlier rail congestion and commuter train interference along the route.UPDATE: Palmetto Train 90 is still stopped south of Alexandria (ALX) due to ongoing mechanical issues. We will provide updates as more information becomes available.Illini Train 392 is currently operating approx. 40min late due to earlier rail congestion south of Champaign (CHM).

We can thus use a regular expression to extract the information we want:

The name of the train: A few capitalized words, followed by “Train”, followed by a number
An optional specifier: “which departed” followed eventually by a date MM/DD
A delay reason: text that follows “due to”
A delay length: a mix of numbers and hour/hr/minute/min, followed by “late”

Despite this admittedly crude analysis, only 6% of tweets lack both an identified delay reason and delay length — pretty good! As always when doing any sort of natural language processing, we should spot check what we’ve missed:

Silver Meteor Train 97 which departed New York (NYP) on 11/22 is currently operating approximately 2 hourUPDATE: Capitol Limited Train 29, will resume service from Washington (WAS) after having mechanical issues resolved. Updates to follow.Amtrak Cascades Train 503 is currently stopped at Portland (PDX). Updates to follow as more information becomes available.

The 6% attrition looks about right; this encompasses malformed tweets and tweets without delay information.

We now have two issues to resolve before we can analyze the reasons and severity of Amtrak delays: only 32% of the tweets list both a delay reason and length and the same event might be tweeted about multiple times. Let’s see if we can’t group the tweets to solve both of these problems.

Certainly, the first thing we want to make sure of is that both tweets reference the same train, but train numbers only identify a train on a specific day. When Amtrak includes the “which departed” message, that helps us narrow it down, but that identifier isn’t always present. We’ll use the following criteria to determine if two tweets refer to the same train:

If the train names or numbers don’t match, the tweets don’t match
If the tweets occurred more than two weeks apart, they don’t match
Otherwise, if the “which departed” message matches (and the tweets are within two weeks of each other), the tweets match
Otherwise, if neither has a “which departed” message but the tweets occurred within 10 hours of each other, the tweets match
Otherwise, they don’t match

(the 10-hour threshold was determined by fiddling with the threshold, looking at grouped tweets with long separations, and picking a number that worked well)

With this, we have 6,252 groups of tweets, of which 4,679 (comprising 8,152 tweets) include both a reason and a time. The downside is, these tweets represent 2,849 unique reasons, with 2,319 occurring exactly once. We need some way to condense the reason information!

due to earlier freight train interference and speed restrictions west of Warrensburg (WAR)due to freight train interference between Roseville (RSV) and Colfax (COX) and additional freight train interference between Truckee (TRU) and Reno (RNO)due to switch issuesdue to an earlier mechanical issue between Chicago (CHI) and Glenview (GLN)due to signal issues, freight train interference and rail congestion along the route

What we’re looking for are common, meaningful, groups of words — the “freight train interference”, “speed restrictions”, “switch issues”, etc in the above reasons. In the NLP world, we refer to strings of consecutive words (or other tokens) as n-grams, letting n be the length of the strings being considered. This has an immediate shortcoming, though: how do we pick n?

As always, the best thing is to just try some and see what happens! Here are the top n-grams for n=1, 2, and 3 (with the initial “due to” stripped):

1: freight, train, and, interference, earlier, of, issues, the, a, mechanical
2: freight train, train interference, rail congestion, mechanical issues, speed restrictions, the route, signal issues, along the
3: freight train interference, along the route, earlier freight train, disabled freight train, a disabled freight, along its route

Here we see the freight train problem immediately, as well as some other train issues. However, beyond those common issues, it’s not until the 34th most common 2-gram that we find a new issue (“trespasser incident”), as the earlier 2-grams are clogged up with nonsense like “and freight” and “along its”.

It’d be nice if we could identify n-grams that are somehow special — not common just because its individual words are common, and not common just because it’s part of a common (n+1)-gram.

This process is called feature selection, and one way of doing this for n-grams is to compute the “glue” of an n-gram, as done in Houvardas & Stamatatos (2006). The basic idea is to construct a directed graph representing the n-grams, and compare the probabilities of each n-gram to the sub-grams it contains and the super-grams that contain it.

A very small section of the n-gram graph

To be precise, the glue is computed by squaring the frequency of the n-gram and dividing by the largest product of frequencies over ways to split the n-gram in two.

The glue formula, modified from Houvardas & Stamatatos (2006)

We can then consider only the n-grams with larger “glue” than any of their neighboring n-grams, and sort them by how much bigger the glue is!

gate crossings
passenger on board
Norfolk Southern
speed restrictions
single tracking
maritime traffic

Ironically, because “disabled freight train” and “freight train interference” are so common, their shared 2-gram prevents either 3-gram from being picked. We could alternatively sort by the glue value itself, or a modified version — the important thing is that we now have a list of n-grams we can work with.

Through the regular feature selection way, there are 396 candidate features, a number small enough that they can be filtered and categorized by hand in a few minutes. We can then repeat this process, removing the reasons we’ve already categorized and re-generating the list of candidate features until we feel confident we’ve captured everything. A list of 78 reasons — n-grams found via this method — represents 96% of the tweet groups, which is pretty good coverage.

N-grams are surprisingly powerful tools for language analysis, given their simplicity. Through these approaches, we’ve sifted through 10,000 tweets, analyzed 81% of them, and found 78 keywords that describe the reasons for the delay in 96% of those cases. What’s left is the process of matching delay reasons with delay times in these groups, which is tricky at first as trains can experience multiple independent delays on their trips. However, once that issue is resolved, we can look at what is causing Amtrak trains to be delayed.

The answer? Freight trains.

How to Find a Freight Train Problem in a Pile of Amtrak Tweets

Written by Patrick Martin