Building a Poster out of Bus Route Maps, pt 1

Tristan Crockett
6 min readJan 3, 2018

--

I use Python image libraries and open data to pay tribute to the humble city bus.

I love mass transit. I really do. My family took a trip to London when I was eight, and I obsessed over figuring out how to navigate us on the Tube. Nowadays, sometimes this love takes the form of taking buses and trains pretty much everywhere. Sometimes it takes the form of reading about transportation and housing policy as it relates to ensuring mass transit is a viable option for as many people as possible. Sometimes it takes the form of building a poster for a wall using bus maps, which is what I’m writing about today.

The Chicago Transit Authority releases bus brochures on their website. They feature a mini map of the entire route, with not a ton of detail but mostly just the route represented as a solid line, with major intersections, transfers, and points of interest marked along the way.

A sample bus brochure. There is a great taqueria right by Halsted off of this route, but sadly restaurants are not considered ‘points of interest’ by the CTA.

The goal of the project is simple: Make a big poster of all of the little maps. Individually, they are all compact and similar in dimension; as you can see from the example above, they are oriented in different ways to make the map fit into the tall shape required by the brochure. This works out because it means the maps should line up reasonably well if I stick them in a grid.

The trickiest task is to extract the actual map from each PDF. I don’t even want the black bars, just the route #/name and the map itself. This is easy enough to do manually in an image program, but the CTA has well north of 100 bus routes, so I need something more automatic. Sure, maybe I could email the CTA and see if they have the originals and would send them to me. But I’m not going to do that. I’m going to write a script.

Downloading all the images

There has to be a list of all these brochures, right? Of course there is. It’s here: http://www.transitchicago.com/travel_information/bus_schedules.aspx

This page actually looks pretty easy to scrape. All the links we want are in neat lists, with very similar link texts. Here’s an example: http://www.transitchicago.com/asset.aspx?AssetId=856

All of the PDF links have this format (differing only by AssetId). Furthermore, all of the ones that link to schedules have text that starts with ‘Schedule’.

To do this, we’re going to use Beautiful Soup, which is a handy Python scraping library.

Grab each link from the CTA’s bus brochure page and download it

As mentioned above, the regular structure of the links on this page help us out; we can filter the links on the page to the ones we want to download by specifying prefixes for both the link target and link text. Once all the schedule links have been collected, they’re downloaded into memory with a standard call to ‘urlretrieve’. Of course, this code snippet by itself doesn’t do anything too useful yet.

Cropping the route maps out of each PDF

This next step is a bit tougher. There are a bunch of other things in the brochure, and we want to extract just the maps. Thankfully, there is some structure to it, but they aren’t totally uniform. The first page of the Pershing bus brochure above is divided into three equal width sections with the map on the rightmost section. Extracting from this layout shouldn’t be too bad.

Another bus brochure layout

How about other maps? Another layout, like the 28 above, is four equal-width sections, with the map in the second section. With a little luck, there is some simple pattern to tease out here, maybe based on the image width.

To work with these images in code, I’ve decided to use Wand, a Python library based on ImageMagick that can work with PDF files. I haven’t used it before, but it seems to offer the necessary features and has good documentation. To start, I had a hunch that the page width might be important for distinguishing between different layouts, and decided to print them out.

Printing out the size of the first page at each link.

The output from this looked something like

<a href="/asset.aspx?AssetId=885">Schedule - Route 165 - West 65th</a>
(792, 612)
<a href="/asset.aspx?AssetId=828">Schedule - Route 169 - 69th/UPS Express<\a>
(504, 612)

Every printout had a height of 612, and the widths all looked like 792, 504, and 1008. Upon further inspection I got a couple more widths. This seems to imply 2, 3, and 4 section layouts. The cropping mode in ImageMagick works by specifying ‘top’ and ‘left’ values to denote the top left corner of the image, and ‘width’ and ‘height’ to denote how far to the right and down the image extends. So I put together some cropping rules based on the image width.

Combining the cropping snippet with the earlier retrieval code, and some boilerplate code to save the results, I came up with this.

Within these cropping rules, the ‘top’ and ‘height’ definitions are all extracting the full vertical span, but there is some variation in the ‘left’ and ‘width’. Each variation is divided into a certain number of equal-width sections, and some of them start in different sections. For example, for a width of 792, the left starts at `(page_width / 3) * 2` to target the rightmost section of the page.

The results:

One maybe-problem, as can be seen by the marble background I artificially added: some of the brochure maps were against a white background and some were against a transparent background! This may not be a problem when overlaid over white, but I’d still like to give them a consistent background, which can be done with one line in Wand:

page.alpha_channel = False

There’s still one other thing I’d like to do: Remove those black bars. I didn’t do anything fancy to achieve this, just played around with ‘top’ and ‘height’ values in the cropping rules until the results looked good. Here’s the full script at this point.

The 15 bus map, after removing the top/bottom black bars.

There are still occasionally pieces of text I would rather exclude like in the 15 bus above, but I can live with those. I consider the bus maps properly extracted.

Assembling the maps into one big image

So, how do I want to sort these? By route number? By route name? By northernmost point? By ridership? There are lots of ways to order the routes in the grid, but I’m going to keep it simple and order the routes how the CTA presents them, numerically sorted.

I still have to figure out a width and height for the grid. Ideally it would be a full grid, but the script produced 127 images, which isn’t conducive to evenly dividing. Close is still good, and I’d rather have a few missing spaces at the end of a line rather than a few extra on a new line. The code below creates an image by taking in the number of columns in the grid and filling in images from left to right, top to bottom.

I tried out a few different values for the number of columns, mostly in the 20s to approximate a golden rectangle shape. I ended up liking 22, because of the shape and minimal remainder.

Version 1 of the bus map grid

It works! That’s as far as this post will go. There are two remaining problems that I can see, that I’ll want to fix before sending this off to get printed:

  1. Not quite high-res enough. The width of each of these is 252 pixels, not quite an inch under 300dpi. I would like to resample this at a higher resolution, but want to play around with different filters to keep sharpness as high as possible.
  2. The route names don’t quite align perfectly. You can even see it at thumbnail size: check out 56 and 57, for example. This might be kind of complicated to take care of.

I will cover both of these in a followup post.

The code I used to generate the poster is also at the repository below (with some minor reorganization). I’ll add any updates I make there, whether or not they are covered by a blog post.

--

--

Tristan Crockett

Software Engineer, Center for Data Science and Public Policy at UChicago