Using data to generate tailored walks

Published in

DataExplorations

7 min readDec 18, 2018

It’s been a long week of hunching over your computer… maybe you’d like to get outside this weekend and go for a nice walk. But where to go? Somehow you always end up following the same routes or wandering through the same neighbourhoods… but wouldn’t be great to explore somewhere new and discover some new sites around the city? If you live in Toronto, then the TorontoWalks custom walk generator may be able to help!

Introduction

TorontoWalks was my capstone project for the Data Science Immersive course at Bitmaker General Assembly here in Toronto. A beta version of this application can be found here: http://toronto-walks.com:8000/

This application takes information on your interests, your desired walk starting point and how long you want to spend and generates a custom walking route just for you!

The main stages involved in generating a custom walk are:

Gathering the data
Preparing the data
Measuring similarity between POIs and user interests
Finding “best matched” stops within reasonable distance of the starting point
Plotting optimal route between stops (Travelling Salesman Problem)

Overview of each stage in generating a walk

Gathering Data

Most of the data was gathered by web-scraping websites with information about Toronto landmarks, including:

ACOToronto.ca
Toronto Plaques
Cabbagetown People
Toronto Public Art (City of Toronto Open Data)

I used BeautifulSoup for most of the web-scraping, but the ACOToronto.ca website had a continuous scroll-to-load-more functionality that required the use of Selenium. In addition, information on public art works was loaded from Toronto Open Data.

Once the data was gathered, I pushed it through a cleanup pipeline to remove duplicates, fix addresses (many of which were incomplete), use GeoCoder to lookup latitude and longitude for each POI and to engineer some new features (i.e. build decade / century and simple POI type (plaque, art or building)).

I then uploaded the data to a PostgreSQL database hosted on a docker image.

Preparing the Data

Each Point of Interest was fed through a DataFrameMapper and Pipeline (containing a CountVectorizer) to create a Vector Space Model with information about the stop type, build century and the words found in the stop name, architectural style and category (a simple bag of words approach). When information is gathered about the user’s interests (from the website), it is fed through the same fitted DataFrameMapper and Pipeline to produce a vector with the same columns.

Measuring Similarity between Points of Interest and User Preferences

So now that we know some things about our Points of Interest (POI) and have gathered some information on what the user is interested in, how do we go about finding the most relevant POIs for that user?

The Walk Generator is somewhat similar to a content-based recommender, but, at this point, I have no access to user ratings of POIs around Toronto. I would like to develop that history over time by saving user generated walks and providing a way for users to like/dislike each stop. But, for now, the application uses a simplified version of content filtering by generating a vector based on the user’s stated preferences and comparing that to each POI vector using Cosine Similarity.

from http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

Find “best matched” stops within reasonable radius of starting point

Now we know which POIs are the best match to the user’s interests, we need to find the best matched stops within a reasonable distance of the walk starting point. The calculation is based on the assumptions that a user could comfortably visit 12 stops in an hour within a radius of 1 KM (1000 meters) from the starting point. This approach worked OK much of the time, but sometimes it produced irrational walks like this one, where stop 1 is complete isolated from all the other stops.

Example walk before Clustering was added

To address this issue, I added an extra step to use a geographically-aware clustering tool on the candidate stops to find a more concentrated subset. I used HDBSCAN, which is a density-based clustering algorithm (meaning that it tries to find areas of the dataset that have a higher density of points than the rest of the dataset.) Most importantly, it supports a variety of distance metrics, including Haversine distance, which properly handles distance between lat/long coordinates (converted to radians). I tuned HDBScan so it produces only 1 output cluster with the desired number of stops (all other stops are labelled as outliers)

clusterer = HDBSCAN(min_cluster_size=int(num_points), metric='haversine',allow_single_cluster=True)
  cluster_labels = clusterer.fit_predict(points)

Adding clustering made a statistically significant reduction in the total distance of the walks. The following screenshot shows an example of clustering applied to a generated walk — in this example, the green dots are the excluded stops, while the blue dots are the clustered stops included in the walk.

Example walk after clustering (blue dots are stops included in the cluster, while green dotes are stops excluded from the cluster and walk)

Plot Optimal Route

Now that we have a set of stops to include in our walk, we need to plot the best route between those stops. This is a type of optimization problem known as the Travelling Salesman Problem (TSP). Here is a good overview of the problem from https://developers.google.com/optimization/routing/tsp:

Back in the days when salesmen traveled door-to-door hawking vacuums and encyclopedias, they had to plan their routes, from house to house or city to city. The shorter the route, the better. Finding the shortest route that visits a set of locations is an exponentially difficult problem: finding the shortest path for 20 cities is much more than twice as hard as 10 cities.
An exhaustive search of all possible paths would be guaranteed to find the shortest, but is computationally intractable for all but small sets of locations. For larger problems, optimization techniques are needed to intelligently search the solution space and find near-optimal solutions.

To solve this routing problem, I tried two approaches

Approach 1: Genetic Algorithm

This project by ZW Miller has a very clear implementation of the Genetic Algorithm, which I borrowed and adapted for TorontoWalks

The high-level idea is that you

generate a set of random guesses (possible order of stops)
for each guess, calculate the travel time between the stops (using geopy.distance.geodesic)
then find the best performing guess and select those to “breed” the next generation of guesses
Combine elements of two “parent” guesses to create a new child guess
include some amount of randomness for genetic diversity
loop a set number of times
track the best guess (lowest travel time)

This generated largely reasonable routes, but was proving to be too slow for an on-demand web application. So I did some more research and came across Google OR Tools

Approach 2: Google OR Tools

OR-Tools is an open source set of tools to solve optimization problems. It uses combinatorial optimization to find the best solution to a problem out of a very large set of possible solutions.

It was easy to implement and very fast at returning routes. Generally all the generated routes have made sense.

Web Site Creation

I packaged the application into a flask application with routes and templates for each html page to be displayed. I used the Google Maps API to drop the points on the map and the Google Maps polyline function to plot the route between the stops (more details on how I did this can be found on the GitHub ReadMe)

Project Deployment

This application is running on 3 Docker Images hosted on DigitalOcean

Database
Web / Flask App
Nginx

I used Docker Compose to get the different images to talk to each other (largely following the template outlined in this blog post)

Taken From : http://www.patricksoftwareblog.com/wp-content/uploads/2017/06/Docker-Application-Architecture-2.png

Application Example

On the main page of the application, users can indicate their preferences. The most important setting is whether they are interested in buildings, historical plaques or public art. Optionally, they can also enter free text in the interests field — this could be an architectural style, historical figure in Toronto history or a category of interest (i.e. sports, explorers etc). Finally, they can choose the desired walk duration and click on the map to set their starting point (or choose “surprise me!” to find the ‘best’ matched walk for their interests)

A custom-designed walk is then displayed, with colored markers for the different stop types (teal for buildings, blue for art and purple for historical plaques.)

Clicking on a stop brings up an InfoWindow with addition details about that stop — what appears varies depending on what information is stored for that stop in the database