Getting off the Struggle Bus, Part 3: Analyzing and Learning From Transit Data

[This is Part 3 in a 3-part series on my experiences as a mentee in the Chicago Python User Group (ChiPy) mentorship program. Read Part 1 and Part 2 here.]

Prelude

There are only three weeks remaining in the ChiPy mentorship program, but the work on my project feels far from over. Currently, I am in the process of building a web app using Flask and Bokeh that allows users to interact with the transit data. By providing a departure time, origin and destination stops, and a day of the week, the app returns a useful summary (including graphs!) describing the trip and wait times of the 55 bus under the given conditions in both the average case and in the worst possible case. I am also working to include visualizations and tables summarizing the daily operations of the route. I think a bonus blog post number four will be in order to show off the finished product.

That said, from the work I have accomplished so far, I have made significant strides as a Pythonista and budding data analyst. I feel very comfortable working in a Jupyter notebook with Pandas to manipulate data into whatever form I need. Using Jupyter has made me embrace an iterative and exploratory approach to data analysis, as its notebooks allow me to quickly inspect my data at each step of the cleansing and analysis process. If I’m being completely honest, I can hardly wait for the mentorship to end — but not because I am experiencing frustration or boredom. Through working on my project, I have thought of many more Chicago transit-related data analysis and visualization projects that I cannot wait to get started on! I am excited by the prospect of continuing to work with Python, Jupyter, and Pandas in future, and I feel empowered by the technical skills I developed during the mentorship as well as the support of the welcoming Chicago Python community. But before I get a head of myself, I have to complete my first project!

Learning From The Data

I want to spend this blog post reflecting on the insights I gained from the data, finally answering the question that sparked this project in the first place, and discussing future avenues for exploration. There are a number of qualitative observations we can make about the data that give a glimpse into how the CTA schedules its buses and how time of day impacts the consistency of scheduling.

Looking at any of the scatter plots of Time vs Trip Time or Time vs Wait Time, you will notice that the data points appear stacked one on top of the other at certain times of the day. Note that the x-axis represents the decimal time (hour + minutes/60) that a bus leaves (or arrives at) stop A headed toward stop B. For any pair of stops, the points are more vertically aligned late at night and in the early morning, becoming less aligned as the day progresses, finally realigning as night starts. Intuitively, this makes sense: it is easier for the buses to maintain a consistent schedule at night when there is less traffic.

Travel Times for buses departing from St. Louis toward Woodlawn. Times when the points are more vertically aligned indicate that buses departed from St. Louis at a consistent time on different days of the week.

Now, consider the scatter plot of travel times for eastbound trips from Midway to MSI. (Note: we could just as well look at the scatter plot of wait times and observe the same thing). Next, examine the scatter plot generated by moving the origin stop one stop immediately to the east (Kostner) while keeping the final stop (MSI) fixed. Repeat this step until you reach Lake Park, the major stop immediately before MSI. Notice that the points start off vertically aligned when the origin is at Midway and become much less so by the time we examine the plots with origin at Lake Park. You will notice the same pattern if perform a similar procedure for westbound trips. This observation makes sense, as it is easier for the buses to depart at consistent times from the terminal stops than it is for them to depart from later stops on schedule. Later in this post, I will discuss a potential future analysis inspired by this observation.

The points are not as well aligned in the right plot at any point during the day. They are especially unaligned during morning and evening rush hours. This combines the previously discussed phenomenon with the current one.

We can also see features of the CTA’s scheduling in the plots. Below is the plot of weekday wait times for buses traveling from Midway to MSI. You will notice 5 peaks in wait times between 13:00 and 15:00. Examining the official bus schedule, for every two (or so) eastbound buses that leaves Midway, an eastbound bus leaves from Ashland. Though less well defined, if you look at the plot of wait times from St. Louis to MSI you still see the peaks. Finally, looking at the plot of wait times from Ashland to MSI, the peaks have entirely disappeared, as there are both MSI-bound buses leaving from Ashland and MSI-bound buses arriving at Ashland from Midway and St. Louis.

We can see another prominent feature of the bus scheduling if we examine the weekend wait time scatter plots. Below is the scatter plot of wait times for buses traveling from Midway to MSI on Sundays. Over the course of Sunday morning, bus service from Midway becomes more frequently. Suddenly, around 12:30-13:00 the wait times jump up to around 25 minutes and remain that long until after 17:00. According to the CTA’s schedule, between those hours, a bus leaves every 24 minutes from Midway and then every 12–13 minutes from St. Louis. If we examine the plot of wait times for buses from St. Louis to MSI, the jump disappears.

Solving The Original Problem

Now, let’s try to answer the question I originally posed at the start of the project: can I use a data-drive approach to better time my trips? Specifically, between 15:00 and 17:00 on a weekday, how long should I expect to wait for an eastbound bus at the Garfield Red Line station? Is there an optimal time within this window to catch the bus? We will examine the scatter plot of wait times for trips from the Red Line to Woodlawn, since that is near where I usually got off the bus.

There is about a 25–35 minute spread in possible wait times during this time interval. Also, notice there seems to be a brief spike in wait times right before 16:00. Now, let’s try to quantify our observations. We want to get an idea of the typical wait time during this interval. Since I will be including outliers in the analysis, finding the median wait time will produce a better measure of the “center” of the data. Next, I will calculate the median over appropriately small sub-intervals rather than the entire two hour interval, since traffic volume and the bus scheduling changes over time. I collected enough data (nearly two months) that sorting the data into 15-minute bins should be okay. We can find the median over each interval by running the following code:

import pandas as pd
import numpy as np
df = pd.read_csv(GarfieldRed_eb.csv)
# 0 = Monday, ..., 6 = Sunday. By default between is inclusive.
garfield_woodlawn = df[(df.stop == “Woodlawn”) & (df.day_of_week.between(0,4)) &
(df.decimal_time.between(15,16.99))]
bins = np.arange(15,17.25,0.25)
groupby([pd.cut(garfield_woodlawn.decimal_time,bins,labels=bins[:-1],right=False)])['travel_time', 'wait_time'].median()

The first five rows of GarfieldRed_eb.csv are given below:

tripid,start,stop,day_of_week,decimal_time,travel_time,wait_time
20170130_5424_204181,GarfieldRed,State,0,19.08,0.77,5.25
20170130_5424_204182,GarfieldRed,State,0,19.17,1.42,11.17
20170130_5424_204183,GarfieldRed,State,0,19.35,1.08,9.42
20170130_5424_204184,GarfieldRed,State,0,19.52,3.48,20.87

The first argument to pandas.cut is the array of numbers we wish to sort into bins. We want to bin the decimal_time column into 15 minute intervals from 15:00 to 15:15, 15:15 to 15:30, …, 16:45 to 17:00. The second argument takes a sequence of edges that defines the bins. With right set to False, the bins include the left edge, but not the right edge. In our case, the bins are [15,15.25), [15.25,15.5), … [16.75, 17). The optional argument labels allows us to give a convenient name to the bins. I label the bins according to their left-most value. Thus, bin “[15,15.25)” is labeled 15, bin “[15.25,15.5)” is labeled 15.25, and so on. Groupby then takes the DataFrame garfield_woodlawn and groups each row according to how the decimal_time column falls in each bin and .median() calculates the median over each bin. The median wait times and travel times over each 15 minute interval is summarized as follows:

decimal_time, median_travel_time, median_wait_time
15.00, 10.830, 8.670
15.25, 10.720, 9.120
15.50, 11.250, 12.030
15.75, 10.520, 12.330
16.00, 11.400, 6.920
16.25, 11.290, 10.680
16.50, 11.365, 9.450
16.75, 11.570, 11.400

I remember waiting over 20 minutes for the bus on a couple occasions during this time period. From the data, it seems that I was really unlucky! 50% of wait times were 7–12 minutes or less. Moreover, I was waiting anywhere from 8–13 minutes longer than the median wait time. But just how unlucky was I? Using scipy.stats.percentileofscore, I can find the percentile rank of any given wait time relative to the other wait times in that interval.

from scipy import stats
for i in np.arange(0,2,0.25):
print “{}: {}”.format(15+i, stats.percentileofscore(
garfieldred_woodlawn[
(garfieldred_woodlawn.decimal_time < 15+i+0.25) &
(garfieldred_woodlawn.decimal_time >= 15+i)
].wait_time, 20
)
)

The first argument to percentileofscore is an array of scores to which another score is compared — I use the list of wait times for each 15-minute sub-interval. The second argument is the score that is compared to the elements in the list. Here, I pass “20” to see how a wait time of 20 minutes compares to the distribution of observed wait times. The output is given below:

15.0: 93.3333333333
15.25: 89.3333333333
15.5: 74.6268656716
15.75: 86.4864864865
16.0: 94.8453608247
16.25: 93.0555555556
16.5: 95.0
16.75: 98.5915492958

Yikes! So anywhere from 74.6–98.6% of the time, the wait time for the next bus is faster than 20 minutes at this time of the day! Based on the median times calculated before, it seems that trying to arrive at the station right around 15:00 or 16:00 yields the best chance at waiting the shortest amount of time for the next bus. Between 15:00 and 15:15, 50% of wait times are less than 8.67 minutes, and between 16:00 and 16:15, 50% of wait times are less than 6.92 minutes. If it’s any comfort, the trip times from the Red Line to Woodlawn over this interval appear fairly uniformly distributed within a 12±3 minute band. Consequently, there is not much need to factor trip times into determining the optimal time to catch the eastbound 55.

Future Analysis

Even after the program finishes, I plan on continuing to explore the location data from the 55 as well as data from other bus routes. In particular, I am interested in seeing if I can use my data to learn where bus bunching tends to occur along a route and to measure how consistently buses stick to their official schedule. Regarding bus bunching, I have a couple of ideas of how to explore this phenomenon. One thought is to examine how the wait times between two buses change over the course of the route. If the wait times start to converge toward zero near a particular stop, this indicates that the buses are bunching together. My mentor suggested I could analyze bus bunching by plotting the variance in wait times during 60 minute intervals. High variance over an interval suggests bus bunching, as opposed to just short wait times due to more frequent service at that time of day.

Determining how well buses stick to their schedule might be a bit more challenging. The tripids assigned to each bus trip don’t seem to have an easily predictable or identifiable pattern, so it would be difficult to use the tripids alone to determine if two trips from different days were scheduled during the same time slot. For example, according to the CTA’s official timetable, a 55 bus is scheduled to leave from Midway toward MSI at 15:14 every Monday through Friday, but it’s not clear what tripids these buses are assigned. Earlier in this post, I discussed how one could visually determine when the CTA scheduled buses to leave from/arrive at a particular stop based on the alignment of data points in the wait and travel time scatter plots. Using machine learning techniques (e.g. k-means clustering), we could determine which bus trips (i.e. tripids) are scheduled for the same time slot based on how well their departure times are aligned at a given origin stop. Once we determine which time slot each tripid belongs to, we could then cross reference the CTA’s official schedule to analyze how much the buses tend to deviate from their scheduled time. This analysis could be applied to other bus routes in the system to determine which routes tend to be the most/least on time, the most consistent, and so on.

I am currently in the process of collecting data from the following buses routes: 6 Jackson Park Express, 9 Ashland, 15 Jeffery Local, 31 31st, 47 47th, 66 Chicago, 72 North, 73 Armitage, 77 Belmont, and 82 Kimball-Homan. In a couple weeks, I should have a bunch of new data to explore and a good starting point for my future analyses.

I’ve always thought the CTA purposely designated route number 66 to the Chicago Avenue bus, since U.S. Route 66 terminated in Chicago.

Finally…

Thank you to all of the organizers and sponsors who help made the ChiPy mentorship a success. A special shout out to Ray Berg, the program director, and Matt Hall, my mentor, for their continuing support and advice over the course of the program. One of the most important things I’ve learned during the program is that the Chicago Python community is friendly and welcoming. I encourage anyone who is reading this post and is interested to get involved in the community. If you have interest in becoming a ChiPy mentor or a mentee, applications for the fall cycle open in a couple weeks.

I hope you have enjoyed reading these posts, and I hope they have sparked an interest in Python, data science, and public transportation. Stay tuned for a bonus blog post and future musing on exploring public transit data.