Getting off the Struggle Bus: Using Data to Better Plan My Public Transit Commutes

[This is Part 1 in a 3-part series on my experiences as a mentee in the Chicago Python User Group (ChiPy) mentorship program.]

The Humble Beginnings

As a student at the University of Chicago in the South Side neighborhood of Hyde Park, CTA bus Route 55: Garfield was my lifeline to the rest of the city. Hyde Park doesn’t have any L lines running directly through the neighborhood, so to travel to and from the Loop, one must either take the Metra, the Route 6: Jackson Park Express bus, or the 55 bus and then transfer to the Green or Red Lines. Unfortunately, the Metra neither runs frequently nor is integrated with Ventra, while taking the Jackson Park Express could be a gamble with Lake Shore Drive traffic. Taking the 55 to the Red Line was the most realistic and popular option for many students.

Even after graduating from UChicago this past June and moving to the North Side, the 55 bus still plays a major role in my life, as I frequently travel to Hyde Park to visit my friends still attending school. In particular, this past autumn I was commuting multiple times per week from Lakeview to Hyde Park. According to Google Maps, the trip should take around an hour.

I usually left for Hyde Park between 2pm and 5pm.

There were, however, multiple instances when my trips met or exceeded 90 minutes. This was not the fault of delays on the Red Line, but rather from waiting over 20 minutes at the Garfield Red Line station for the eastbound 55 bus. According to the CTA’s official bus schedule, during the time of day I typically commuted, a 55 bus leaves every 9 to 18 minutes from Midway and every 9 to 10 minutes from Ashland. In a perfect world, an eastbound bus should arrive at the Garfield Red Line station at least every 10 minutes. More often than not, this did not hold true.

Finally, one afternoon in late October, after arriving over 30 minutes late to an event on UChicago’s campus, soaked in rain, because ­the 55 had let me down when I needed it the most, I had had it. I realized there had to be another way!

Out of my personal struggle, an idea was born: what if I could use a data-driven approach to better time my trips?

For example, what if I could find data detailing which buses arrive the most or least on time? Maybe I would discover the 55 frequently runs late, and I could alter my commute by taking a different bus, say the Jackson Park Express from downtown. Or, what if I could find data logs of when buses arrived at certain stops? I could then analyze the data and find the average wait times for the 55 at different times during the day. Perhaps I could find a sweet spot between the afternoon slump when the number of active buses and the volume of traffic is low and five o’clock rush hour when the number of active buses and the volume of traffic is high.

After doing a little research, I came to the following conclusions:

  • There is no publicly available data on which buses tend to be the most/least on time, which buses experience the worst bus bunching, or even just logs of the times at which buses arrived at particular stops. (The CTA does have other publicly available data sets which you can view here.)
  • I could, however, collect live bus location data by accessing the CTA’s Bus Tracker API and derive the information I wanted.
  • Python is a great language for scripting and data analysis. Despite having minimal experience with Python, it would be an easy language to pick up to get started quickly. Plus, the language has a number of great libraries and tools for analyzing and playing with data, particularly Pandas, Numpy, and Jupyter Notebooks.
  • At the time I didn’t have a job, so why not start a project? The only thing I’m really doing is binge-watching episodes of Seinfeld. I might as well do something useful with my time.

After teaching myself a modest amount of Python, I wrote a Python script that was able to interact with the Bus Tracker API. I ran my script 24/7 during February and March, requesting the location data of all active 55 buses every 30 seconds. By the end of March, I had collected 40MB worth of data, giving me almost 1,000,000 data points to work with. So, now that I have two months worth of bus location data on my hard drive, what’s next?

The Mentorship

During the time that I was collecting data, I learned about the ChiPy mentorship program, a one-on-one mentorship program that helps individuals of all Python skill levels grow through the guidance of an experienced Python developer. I applied thinking this would be the perfect opportunity to get some help on my project idea as a beginning Pythonista. Needless to say, here I am. And I am happy to say that I am working on my project under the guidance of my mentor Matt Hall, VP of Software Engineering at Mintel. I look forward to the next couple months of working together.

As part of the mentorship program, I plan to build a full-fledged project centered around my bus location data. I have a number of goals for my project and for my experience in the program. There are two main pieces of information I wish to discover from the data:

  • The distribution of wait times between buses at any given major stop at a particular time of day
  • The distribution of trip lengths between any two major stops at a particular time of day

From the distributions, I can calculate median wait and trip times and also find a range of wait and trip times that are typical given a certain time of day. Furthermore, I don’t want to horde all of this data for myself. I want to share it! As part of my project, I also want to create a website with interactive data visualizations where users can plan their trips by selecting the two stops they intend to travel between. If all goes according to plan, I hope to collect data from even more CTA bus routes and add them to the website. I want to allow people to make smarter decisions about how they plan their trips on public transportation, and I hope my project encourages more people to ride the bus! Beyond creating a website that is useful for others, I hope to become a more skilled and independent Pythonista and also become more involved in the Chicago Python community by the end of the mentorship.

The Data

Before I wrap up, let’s take a quick look at the data I will be working with in my project. When I run my data collection script, the JSON response from the Bus Tracker API looks a little something like this:

The API returns the statuses of all active buses at the time the script was run. In this case, there were seven active buses, though I’ve only shown the data for two of them. Let’s take a look at the meaning of some of the data fields I will be referring to frequently during future blog posts:

  • “tmstmp” (timestamp): This is the date and time when the bus last updated its position. For example, the first bus shown last updated its location at 22:57:52 on April 18, 2017.
  • “pid” (pattern ID): For each trip the bus takes, it executes a certain “pattern,” where the bus visits a set sequence of stops. On Route 55, there are a number of patterns the bus can execute, but the main patterns are 5424 (Midway to Museum of Science and Industry, eastbound), 5425 (MSI to Midway, westbound), 1293 (MSI to St. Louis, westbound), 1300 (St. Louis to MSI, eastbound).
  • “pdist” (pattern distance): This is the distance in feet that the bus has traveled along the pattern currently being executed. The distance is not necessarily the same for patterns with the same terminal stops but that head in opposite directions. For example, the pattern distance for trips from MSI to Midway is 50,403 feet, whereas the pattern distance for the trips from Midway to MSI is only 49,110 feet.
  • “tatripid” (Transit Authority Trip ID): This is the CTA’s scheduled block identifier for the work currently being performed by the vehicle. Each trip is designated a certain ID for that day. As far as I’m aware, no two trips in a single day may have the same ID numbers.
  • “dly” (delay): If this field is True, the bus has been delayed.

You may have noticed that the buses also report their latitude and longitude. This data is not necessary for the project, since the pdist tells us how far the bus has traveled along its route. This saves us the hassle of determining the distance between GPS coordinates! Finally, the data that I am using in my project is not quite in the same form as the JSON response from the Bus Tracker API. My script extracted only the necessary fields and saved each data point as a line in a CSV:

The Problems

Analyzing and making sense of the data seems like it should be easy, right? Wrong! Well, this is only partially true. Finding the median or standard deviation of a set of numbers is a fairly straightforward task. But the bus location data needs to be cleaned and transformed into a more meaningful form before it can actually be analyzed. There are a number of quirks that need to be ironed out:

  • The data doesn’t tell me when the buses have arrived at a particular stop. In order to determine this, I need to download a list of bus stops and their pdist along the bus routes from the Bus Tracker API. Once I know the locations of all of the stops, given a particular tatripid and a particular bus stop, I can find the point immediately before the bus reaches the stop and the point immediate after and then interpolate the time the bus arrived at the stop.
  • For any given bus, I don’t always know the exact time that a trip started or ended. When a bus reaches its final stop, they sometimes idle for 5 or 10 minutes before turning around and beginning their next trip. Not accounting for this will impact the accuracy of trip time calculations.
  • In some cases, the buses don’t completely “finish” the route pattern. For example, a bus executing pid 5425 (MSI to Midway) might end its trip 500 feet short of the official end point, since the bus terminal at Midway is rather large. This causes problems when interpolating the data, as I illustrate below.
Westbound 55 buses on March 1st, 2017. It appears that most trips take 45–60 minutes. On the x-axis is time in elapsed in HH:MM:SS. On the y-axis is pdist in feet.

Above is a quick plot I made of a day’s worth of interpolated data points for westbound 55 buses — a plot of the raw data is a bit more messy. You might notice that some bus trips stop around 43,000 feet. This is the location of Kostner Ave, the major stop immediately before the final stop at Midway Airport. The full length of pid 5425 (MSI to Midway) is 50,403 feet. If the bus ends its trip even one foot short of the terminal stop, I can’t interpolate the time it arrives there! In this case, I will need to think of another method to estimate the times that these buses arrived at Midway.

There is so much more work ahead of me. I haven’t even touched on the details of building a website and creating interactive visualizations of the data! I am very excited and fortunate to be a part of this spring’s ChiPy mentorship program. If you enjoy Python and public transportation, be sure to stay tuned for future blog posts.

This is the full extent of Route 55: Garfield. Many trips start and end at St. Louis Ave, which I’ve circled in red.