Getting off the Struggle Bus, Part 2: Cleaning and Plotting Public Transit Data

[This is Part 2 in a 3-part series on my experiences as a mentee in the Chicago Python User Group (ChiPy) mentorship program. Read Part 1 here.]

Prelude

Since starting the ChiPy mentorship, I haven’t ridden on the 55 bus once. Lately when traveling to Hyde Park, I have opted to get off the Red Line at 47th Street and transfer to the southbound Route 15: Jeffery Local. This is more of a function of where in Hyde Park I have been visiting recently (the 15 traverses the northern border and eastern edge of Hyde Park, whereas the 55 cuts through the middle of the neighborhood) rather than a desire to avoid the 55 due to any lingering trauma from my past experiences. Besides, taking the 15 puts me in the perfect position to visit one of my favorite small grocery stores in the city, Hyde Park Produce. If I have time before the end of the mentorship, I am interested in collecting location data from the 15 and comparing its trip and wait times with those of the 55. One of my friends describes the 15 as “disappointing” and “unreliable”; however, my experiences waiting for and riding it have been nothing but pleasant. I suppose data will have to settle this disagreement.

For now, I am concerned with making sure the data I collected from the 55 is clean and accurate and that I am able to build interactive visualizations so that others can explore the data. In this post, I will discuss how I transformed the raw bus location data into usable data and also show off some of the graphing tools in Python that I’ve tried.

Preparing the data

Most data that we wish to use — whether it’s the raw data I collected from the Bus Tracker API, data scrapped from a website, or data downloaded from a public database — is not in the form that we want to use. That is, the data we obtain is often not in a form that is immediately convenient for us and our project, but rather is in a form that is convenient for whoever originally collected the data. There are a number of ways this problem can manifest itself:

  • The data files are not in the right format. Perhaps the data we obtained is stored in a JSON file, but it would be easier for us to manipulate if it were a CSV.
  • The data entries are not the right type. For example, if you load a data file with a date field into Pandas, the dates will probably be cast as Strings. We often want to convert these Strings to DateTimes (or a similar data type) so that they are more convenient to use.
  • The data does not contain a field we need. In this case, we are hopefully able to derive the desired field from the existing fields. Perhaps our data has a date field, but we are only interested in the particular year. Or maybe we want a field that is the sum or difference of two other fields.
  • The data set contains more fields than we need. Sometimes it is convenient to drop the unwanted fields to produce smaller output files.

Moreover, many raw data sets contain various quirks:

  • Missing entries. In especially heinous cases, these entries are filled in with junk data like 0's or -9999 to signify an unknown or missing value.
  • Errors as a result of manual entry by a human or errors caused by the apparatus that collected the data
  • Inconsistent formatting. For example, a database with an address field might have an entry with the address “4600 S Dr. Martin Luther King Drive,” whereas another entry referring to the same address might have entered “4600 S King Dr.”

Because of the above problems, we the programmers and data analysts must perform heavy labor: we must first clean, edit, sort, transform, massage, take out for dinner, shower with gifts, and constantly validate our data before it is in a format that is useful and convenient for analysis. I have encountered many names for these first steps — data processing, data wrangling, data cleansing, data munging—and all of these names have slightly different though overlapping definitions. I personally prefer the terms data cleansing/cleaning, because of the image the words evoke in my mind.

Most of the labor I performed in the cleansing phase involved deriving new fields from the existing data, rather than cleaning up errors and handling inconsistencies. Since the Bus Tracker data is not manually entered into a database by humans, but is instead uploaded to the CTA’s servers by a device in the bus that uses GPS, the data shouldn’t have any inconsistent formatting or incorrectly entered information. Just because there are no human errors, however, does not mean the quality of the data is perfect. Every now and then a buses do not upload their location to the CTA’s servers for an extended period of time, resulting in “missing” data. In such cases, I either need to deal with the data I have or throw it out.

The above image outlines the steps of my data cleansing process. The technical details of how each step is performed are not particularly interesting, so I will not discuss them in this post. Toward the end of the cleansing process, the raw data has been transformed from a collection of 53 CSV files (one for each day of collection), where each row contains the location of one bus at one time, into two CSV files formatted like timetables. Each row in the timetables contains a single bus trip which includes a unique trip ID, the trip’s pattern ID (pid), the direction the bus is headed, and the times at which the bus arrived at each stop. If a bus does not reach a particular stop, the time for that stop is entered as nan (not a number). Below is a row for a bus executing pid 1293 (MSI to St. Louis, westbound). As you can see, the bus ends its trip at St. Louis, so there is no data for when the bus reaches Pulaski, Kostner, or Midway.

tripid,pid,direction,MSI,LakePark,Woodlawn,CottageGrove,GarfieldGreen,State,GarfieldRed,Halsted,Racine,Ashland,Damen,Western,California,Kedzie,StLouis,Pulaski,Kostner,MidwayOrange
...
20170131_1293_204342,1293,W,2017-01-31 01:43:08,2017-01-31 01:45:33,2017-01-31 01:47:35,2017-01-31 01:49:56,2017-01-31 01:52:40,2017-01-31 01:54:19,2017-01-31 01:56:11,2017-01-31 01:59:25,2017-01-31 02:02:22,2017-01-31 02:02:22,2017-01-31 02:05:40,2017-01-31 02:07:08,2017-01-31 02:08:40,2017-01-31 02:10:34,2017-01-31 02:11:35,nan,nan,nan

Formatting the data as a timetable allowed me to easily calculate travel times between two stops for a given trip and wait times between two successive buses. To do so, I first loaded the timetables into a Pandas DataFrame and then iterated over all of the bus stops and all of the possible destination bus stops. I subtracted the origin stop column from the destination stop column to calculate trip times. To calculate the wait times, I first sorted the DataFrame in chronological order with respect to the origin stop, I then dropped all rows where the destination stop has a null time (a simplification that causes some undesirable consequences), and finally subtracted adjacent rows from each other.

import pandas as pd
MIN_PER_DAY = 1440
df = pd.read_csv(timetable_eb.csv)
stop_list = list(df)[3:] #first 3 columns are tripid, pid, direction
df[stop_list] = df[stop_list].apply(pd.to_datetime)
for start_stop in stop_list[:-1]:
destinations = stop_list[stop_list.index(start_stop)+1:]
for end_stop in destinations:
sorted_df = df.sort_values(by=start_stop)
sorted_df.dropna(subset=[start_stop, end_stop], inplace=True)
sorted_df["start"] = start_stop
sorted_df["stop"] = end_stop
sorted_df["day_of_week"] = (sorted_df[start_stop] - pd.DateOffset(hours=3)).dt.dayofweek #New days start at 3AM
sorted_df[start_stop].apply(lambda x: round(x.hour + x.minute / 60.0,2))
sorted_df["travel_time"] = sorted_df[end_stop] - sorted_df[start_stop]
sorted_df.travel_time = sorted_df.travel_time.apply(lambda x: round(abs((x / np.timedelta64(1, 'D')) * MIN_PER_DAY), 2))
sorted_df["wait_time"] = sorted_df[start_stop].diff(periods=-1)
sorted_df.wait_time = sorted_df.wait_time.apply(lambda x: round(abs((x / np.timedelta64(1, 'D')) * MIN_PER_DAY), 2))

I saved the output as a collection of CSV files split up by starting bus stop and trip direction, the code for which I omitted in the snippet above. The first five lines of Ashland_eb.csv are listed below:

tripid,start,stop,day_of_week,decimal_time,travel_time,wait_time
20170130_5424_204183,Ashland,Racine,0,19.23,1.92,8.9
20170130_5424_204184,Ashland,Racine,0,19.38,1.97,16.93
20170130_5424_204185,Ashland,Racine,0,19.67,2.17,11.53
20170130_5424_204186,Ashland,Racine,0,19.85,1.92,16.12

You may be wondering why its important to consider a destination stop when discussing the wait time between two buses. As I’ve mentioned, buses can execute several different patterns, i.e. the sequence of stops the bus visits. The full extent of Route 55 is along 55th street/Garfield Boulevard between the Museum of Science and Industry (MSI) and Midway Airport. When executing certain patterns, the 55 only travels a subset of the full route. A number of buses only travel between MSI and St. Louis Avenue, and there are other minor trips between Ashland and MSI, Kedzie and MSI, and so forth. Consequently, it is important to know where the bus is going to end its trip. For example, if I want to travel from MSI to Midway, and the next bus departs in 5 minutes, but it is only going as far as St. Louis, then the wait time for the next bus is not 5 minutes. Rather, the wait time is however long until the next bus to Midway departs.


Plotting in Python

Now that I have files containing the trip and wait times between stops, it is time to create plots of the data to see if my scripts did what I intended them to do and to explore the libraries I might use on the project website.

Matplotlib

Of the plotting libraries I tried, matplotlib is the simplest to use if you need to quickly plot your data. Best of all, if you are using a Jupyter notebook, you can display the graphs inline if you include the first line of the code below:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
# the way I read the data into Pandas is more complicated in my actual code
df = pd.read_csv(data.csv)
stop_selections = df[(df.start == "Ashland") & (df.stop == "MidwayOrange")]
plt.scatter(stop_selections.decimal_time, stop_selections.travel_time, s=1, c='navy')
plt.xlim(xmin=0, xmax=24)
plt.ylim(ymin=0, ymax=80)
plt.show()
This is actually the graph produced by pandas. The axes are labelled automatically from the names of the DataFrame columns that are used.

Note that in all of the following graphs, the x-axis plots the time that the bus left the origin stop. The (decimal) time is given in hours since midnight, and the travel time is given in minutes. It is worth noting that Pandas has built-in plotting capabilities that use matplotlib as their foundation. The following code produces the same graph:

stop_selections.plot.scatter(x='decimal_time', y='travel_time', s=1, c='navy', xlim=(0,24), ylim=(0,80), figsize=(8,8))
plt.show()

Unfortunately, there is no way to quickly create multiple plots at once. For example, if I wanted to create a plot of the travel times from Ashland to Midway broken down by day of the week, the fastest way to do so with matplotlib is with a for loop. Moreover, its ability to create interactive visualizations is fairly limited without the help of other libraries like Plotly.

Seaborn

Enter Seaborn, at least to solve the first problem. Seaborn is also built on top of matplotlib, and it was built with statisticians in mind. Besides having a variety of color palettes to make prettier plots, Seaborn makes it easy to plot many variables at once with conditional plots — a plot of two variables X and Y conditional on the value of any number of other variables. Using the Seaborn object FacetGrid, I created plots of the travel times from Ashland to Midway broken down by the day of the week:

import seaborn as sns
df = pd.read_csv(data.csv)
stop_selections = df[(df.start == "Ashland") & (df.stop == "MidwayOrange")]
g = sns.FacetGrid(stop_selections, col='day_of_week')  
g.map(plt.scatter, 'decimal_time', 'travel_time')
g.set(xlim=(0,24),ylim=(0, 80))
0 corresponds to Monday, 1 to Tuesday, and so on. There are fewer data points for Saturdays (5) and Sundays (6) since fewer buses are in service during weekends. There also appear to be no big peaks in travel times around 8am and 5pm since fewer people commute to work over the weekend, causing less traffic congestion.

If you want to be more adventurous, you can graph two variables conditional on the value of multiple variables. Below I’ve graphed time vs travel time broken down by the starting stop and destination stop.

import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(data.csv)
stop_selections = df[(df.start == "Ashland") & (df.stop == "MidwayOrange")]
g2 = sns.FacetGrid(stop_selections, col='start', row='stop')  
g2.map(plt.scatter, 'decimal_time', 'travel_time')
g2.set(xlim=(0,24),ylim=(0, 80))
Wow! Lots of plots!

Admittedly, an 18x18 grid of plots is a bit overwhelming! But it gives you a taste of what Seaborn can do. You might notice some of the plots are empty. Obviously, if the starting and destination stops are the same, then the bus has not traveled at all, so there is no data to plot. The FacetGrid object is particularly useful for me, because I can quickly create many plots without having to write much code. The plots provide a sanity check that my scripts are working correctly. For example, most plots exhibit two peaks in travel time during morning and evening rush hours. Travel times also appear to be the quickest late at night or very early in the morning when there is little traffic and few people are riding the bus. This gives me confidence that the scripts cleaned and transformed the data correctly.

Bokeh

Unlike the previous two examples, Bokeh creates interactive visualizations aimed at being displayed in web browsers. You can integrate Bokeh to display plots inline in a Jupyter notebook as well. It’s not the simplest library to use if you want to quickly plot and examine your data. If you want to explore your data through interactive visualizations, however, Bokeh is the clear choice over matplotlib and seaborn. Bokeh supports zooming and panning without the need to write any extra lines of code. By adding widgets to your application, you can create even more complex interactions. The following code is for a simple interactive scatter plot of time vs travel times between two bus stops. Two selection widgets allow the user to choose which two stops they are interested in.

import pandas as pd
import numpy as np
from bokeh.layouts import column
from bokeh.models.widgets import Select
from bokeh.plotting import reset_output
reset_output()
# keeps jupyter notebook in sync with bokeh server
session = push_session(curdoc())
df = pd.read_csv(data.csv)
stop_selections = df[(df.start == "Ashland") & (df.stop == "MidwayOrange")]
p = figure(title=title, x_range=(0,24), y_range=(0,80))
scatter_plot = p.circle([], [], color="navy", radius=0.01)
line_graph = p.line([], [], color="red")
select_start = Select(title="Start", options=R55_BUS_STOPS.keys())
select_stop = Select(title="Stop", options=R55_BUS_STOPS.keys())
def update(attr, old, new):
p.title.text = "Route 55 travel times: " + select_start.value + " -> " + select_stop.value
stop_selections = df[(df.start == select_start.value) & (df.stop == select_stop.value)]

bins = np.arange(0,24.25,0.25)
grouped = stop_selections.groupby([pd.cut(stop_selections.decimal_time,bins,labels=np.arange(0,24,0.25),right=False)]).travel_time.median()

scatter_plot.data_source.data["x"] = stop_selections.decimal_time
scatter_plot.data_source.data["y"] = stop_selections.travel_time
line_graph.data_source.data["x"] = np.arange(0,24,0.25)
line_graph.data_source.data["y"] = grouped

select_start.on_change('value', update)
select_stop.on_change('value', update)
session.show(column(p, select_start, select_stop))
session.loop_until_closed()
The solid red line indicates the median travel times.

Next Steps

I plan on continuing to make more sophisticated interactive visualizations in Bokeh. I’m also interested in investigating other Python data visualization libraries like Pygal and Plotly. In my final post, I hope to discuss my experiences setting up a website that allows others to interact with the data, and I also hope to provide a final analysis of the data. Stay tuned for the final post!