Python for Transit: Get the Most Out of Your GTFS With This Python Package
What this article is about
This article is meant to be a light introduction to the Python package gtfs_functions. This package was specifically designed to speed up some frequent GTFS spatial analyses I do at Remix and it should be easily accessible for anyone with a little Python knowledge.
If you feel that this content is not thorough enough on the explanation for each function, I recommend you visit the article that dives in the specific function you are interested in.
Also, note that the outputs of these functions are GeoDataFrames, mostly intended to be visualized on maps but they are also susceptible to be analyzed in non-spatial plots.
For the article, I downloaded the GTFS from SFMTA (San Francisco, CA).
Using the functions of this package we will:
- Read the GTFS files into Pandas DataFrames and GeoPandas GeoDataFrames.
- Calculate the frequency per line, direction, and time of day and put it in a GeoDataFrame with LineStrings.
- Calculate the frequency per stop, direction, and time of day and put it in a GeoDataFrame with Points.
- Cut the routes shapes in segments that go from stop to stop.
- Calculate the scheduled speeds per bus segment (from stop to stop), route, direction, and time of day and put it in a GeoDataFrame with LineStrings.
- Calculate the frequency per bus segment (from stop to stop), direction, and time of day and put it in a GeoDataFrame with LineStrings.
- Show each of these on a map, styled by a variable.
- Export our work to a spatial file (ESRI Shapefile or GeoJson).
- Check out other possible plots (with Plotly).
About GTFS (skip if you already know about it)
If you are slightly related to the transit industry, you have most probably heard about the GTFS standard.
If you haven’t, let me put it simply for you: GTFS is the data standard transit agencies use across the world to upload their transit offer to Google Maps. It is implemented across (almost) every Public Transit Authority in the US, largely implemented in Europe and many cities in Latin America are getting on board with it as well.
Its advantages are a lot, but I will only name a couple that are specifically relevant to this article:
- It has both: timetables and geometry information about the bus routes
- It is widely adopted by PTAs across the world
In other words, GTFS has enough information to visualize things on a map, and if you learn to work with it, you’ll be able to analyze transit in many different cities across the world! Excited yet? I know! Now, let's get to it.
Note: If you are not familiar with the files of a GTFS zip file and how they relate to each other, I recommend you to read the official documentation about it.
GTFS functions, the package
This package groups a series of functions that I usually use in my workflow as a Data Scientist at Remix. You can find the repository and official documentation on GitHub.
Note: In the following sections I assume the reader is already familiar with the GTFS text files and how they relate to each other.
Package installation and import
To install the package and import it to your notebook run the following:
1. Read the GTFS
To parse a GTFS into the datasets we run the following:
This will generate three DataFrames (routes, stop_times, and trips) and two GeoDataFrames (stops and shapes). As we will see below, these dataframes are slightly different from what you would get by reading each file independently with the normal pd.read_csv() function. I will get into more details about it later, but for now, all you need to know is that these modifications are made to better fit our workflow.
It is also worth noting that the parameter busiest_date=True. This means that only the information for the busiest date of the GTFS will be imported. If you want to import the information for all the dates in the calendar you can apply busiest_date=False.
Let’s take a look at what we have.
Pretty much the same as the original “routes.txt” from the GTFS with the added value route_name that is the concatenation of route_short_name and route_long_name.
The main difference with the text file from the GTFS is that it is a GeoDataFrame with a Point geometry that can be mapped.
The arrival and departure time are expressed as seconds from midnight and has more information than the original text file like route_id, shape_id, stop_name, runtime_h, and of the Point geometry.
Each shape_id has a LineString associated with it. This LineString was created from the list of lat and lon for each shape in the original text file.
2. Calculate stop frequencies
With the output of the previous step:
The output for one specific stop shows:
3. Calculate line frequencies
With the parsed GTFS from step 1:
The output for one specific line shows:
4. Cut the shapes into segments
The output shows:
5. Calculate segments speeds
The output for one specific segments, direction, and time of day shows:
Note that in the example above the chose segment 3114–3144 appears four times: one for each of the routes that serve that segment and a fourth time for the route “All lines”. This route is created by the function and it aggregates the weighted average speed in that segment taking into account all the routes that stop in its starting and ending stop.
Also, notice that the aggregated value for “All lines” takes into account the three segments, ignoring the direction the lines had in the GTFS. This makes sense since the segment always starts and ends in the same stops, even if the assigned direction is different in the GTFS.
The route “All lines” is created by the function itself and it aggregates the weighted average speed in that specific segment taking into account all the routes stop in its starting and ending stop.
6. Calculate segment frequencies
In the same way, we had the frequency by line, we can now have the frequency by bus segment.
It takes the same three arguments that we had for speeds:
The output for one specific segments, direction, and time of day shows:
Note that the same behavior we saw in “speeds” repeats here. The aggregation of “All lines” for the segment disregards the assigned direction in the GTFS, and only considers the starting and ending stop for the segment.
7. Show results on a map
You can always export the GeoDataFrames we saw and open them in your favorite GIS software, but I added a function to allow the user to quickly take a look from the notebook before going into that workflow. It is not meant to be presentation-ready or fully customizable, just to take a quick look.
The function map_gdf() is built on top of the folium library and allows you to quickly visualize and style the data on a map.
It takes 6 arguments as shown below. For example, to visualize line frequencies:
If we want to see the data for stop frequencies:
If you are looking to visualize data at the segment level for all lines I recommend you go with something more powerful like kepler.gl (AKA my favorite data viz library). For example, to check the scheduled speeds per segment:
You will need to manually style the colors and filters but you will have complete control over the visual. Or you can always learn to do it programmatically (which I haven’t yet).
8. Export your work
You can save your GeoDataFrames into a shapefile or geojson with geopandas like with any other GeoDataFrame.
Nonetheless, since I found it repetitive to specify all the parameters geopandas needed to create a shapefile or geojson I built the function save_gdf() on top of that to help me save some time.
To save a file as both, shapefile and geojson:
If the arguments shapefile and geojson are not specified, the function will save it as a shapefile by default.
9. Some other possible interactive plots
In this section, I show a few examples of how the data could be visualized in plotly but it is by no means an extensive list of the things one could do.
The most frequent plot I use is a histogram to identify the best cutoffs to style the layer:
This code creates a Heatmap with the scheduled speeds per segment for one specific route and direction.
And this is the interactive output:
3. Line charts: Speed on a segment by hour of the day
I’ll admit this one is a little more advanced but in all truth, it is just a few annotations one has to add to make the plot much better.
The interactive output shows:
Acknowledgments & References
Far from taking credit from other’s work, I want to acknowledge that some functions of this package were built on top of great and more generic packages and were just slightly modified to better serve this specific workflow.
For example, the function import_gtfs() heavily relies on partridge, a powerful Python library created by Remix founders that makes parsing a GTFS very easy. Similarly, map_gdf() and save_gdf() are built on top of folium and geopandas respectively.