Optimizing Travel Time with Google Maps: Part 1

Experiment Design

Marco Sanchez-Ayala
5 min readApr 12, 2020
Photo by henry perks on Unsplash

My girlfriend and I live in two different NYC boroughs, making travel between our apartments a real pain. Although Uber/Lyft are not too expensive at off-peak times, my method of choice is public transit since I already pay for a monthly unlimited pass.

It usually takes around an hour to get between our apartments, given that I typically only make the trip on weekends. I generally come and go at random times, and had never given much thought as to which times of day would be quickest for travel. Eager to put some new ETL skills I gained through Udacity’s Data Engineering nanodegree program into practice, I set out to study when we can get between our apartments the quickest via public transportation, namely subway and bus.

Hypothesis

I predict the quickest travel times between our apartments will be during rush hour times because train availability is maximized in these times. However, it would be super inconvenient for us if that really is the best time to travel between our apartments because it would never make sense for us travel to each other’s place at 8 am on a Tuesday, for instance. Thus, I hope this hypothesis is wrong but it does seem to make the most sense as of right now.

First Thoughts

Initially in designing this experiment I thought I would want to access the MTA’s API, since they run NYC’s public transit and would probably have the most accurate bus and train information. However, the trip between our apartments usually requires multiple transfers and a bit of walking. Since I’m interested in capturing all the different transfers, as well as the door-to-door time, it makes sense to instead use a service like Google Maps which easily does all of this. Part of their API allows for visualization of such trips along with other cool functionality, so it made sense to go with this instead.

I quckly found a handy Python driver to access Google Maps’s API called googlemaps. After a little exploring, I found that all of the information I would need for this study could indeed be found with Google Maps data. I just have to make a couple assumptions about what I’m gathering.

Experimental Method

It makes most logical sense to me now set up an experiment that follows this sequence of steps:

  1. Call the Google Maps API for public transit directions between the two locations every 5 minutes. Each set of directions should be parsed and then saved locally as a JSON file. I can afford to save this locally for now since there won’t be much data and I don’t want to pay for storage online. Eventually I could batch upload everything to S3 or Cloud Storage to minimize the number of transactions.
  2. After a week when all data is collected, run an ETL script to push the information from the JSON files to a PostgreSQL database. I plan to host it in a Docker container.
  3. Display sample queries from the database in a simple dashboard using Dash by Plotly.

From the visualizations in the dashboard we’ll be able to answer pretty much any questions we might have about the data, namely when is it fastest to travel between our apartments?

Assumptions

  1. The data is accurate. While I believe Google has a great service and the most up-to-date information they can gather, the NYC subway is unpredictable. You never know when delays will strike. I use Google Maps every single day getting from place A to B in NYC using public transit, and can say that the trip duration estimates are usually correct within a minute or two with reliable train lines. For simplicity sake, this study assumes that the trip durations are always accurate.
  2. Sufficient “best trips” can be captured with a 5-minute API call interval. There are two motivations for this. First, the trains around me usually don’t even run any quicker than in 3 minute intervals so it shouldn’t make a big difference to not capture every single possible trip. Second, I have a finite amount of Google Cloud credits, so I chose to not call Google’s API at a smaller time interval like every minute.

Implementation

I’m developing this project using Python and you can see it on GitHub! Here’s what’s inside.

  1. The first module I wrote was directions.py, where I define the Directions class that handles connections to the Google Maps API, parsing information, and exporting to JSON.
  2. Next, data_collection.py instantiates the Directions object for trips between the two locations and stores the JSON files in a subdirectory.
  3. sql_queries.py contains all queries to handle DROP, CREATE, and INSERT statements for all of the tables that will be stored within PostgreSQL.
  4. create_tables.py calls sql_queries.py to make all the tables we’ll need.
  5. etl.py extracts the JSON data and transforms it into the form that is then loaded into PostgreSQL.
  6. app.py will become the Dash app that will display the dashboard.

Next Steps

Many things need to happen!

  • I need to keep developing directions.py, which currently is not in its most modular form. It works fine now, but also will need to be tested for best usage.
  • I haven’t yet set up the scheduler to run data_collection.py every 5 minutes. I’ll see if I can do this with cron or some other scheduler.
  • etl.py needs to be rewritten to do what is stated above.
  • app.py needs to be fully implemented.

Extensions

I’m really excited to finish the above next steps in the coming days! I’ll be able to extend this project to more than just public transit. Google Maps, as we know, also works for walking, driving, and biking directions. So I could easily extend this project to do the same. Collecting all that data would allow someone to perform all sorts of predictive analytics on travel times in the future.

I plan to write some follow up posts to show off the product once it’s up and running to discuss how it works and its limitations. Let me know if you have any ideas or want to contribute!

--

--