Building CaliforniaTrailFinder

A Web-Based Recommender System for Finding Similar Hiking Trails

Published in

The Startup

11 min readJan 5, 2021

In this post, I describe the process of building a web-based recommender system for finding similar hiking trails in California compared to a previously liked user specified trail. The process included data collection, database management, exploratory data analysis, model development, and web development/deployment. I will also describe the tools that were leveraged for performing the steps in the process.

Overview

California is as varied as it is vast. There is no other state which has as many options for hiking trails as California. As much as I like to go out and experience the great outdoors on a beautiful California hiking trail, I like to share these experiences with my family even more. I’m a father with a 4-year old daughter. And as any parent can imagine, keeping a child engaged and interested isn’t easy. My hope is that my daughter would enjoy going out on a hike as much as I do. I want her to enjoy nature, get healthy exercise, and explore interesting areas together with my wife and me.

We’ve gone on some trail walks before, ones that I would consider kid friendly. And I’m pretty sure that my daughter enjoyed herself — she stayed engaged and interested! With successful outings like those, how can I keep it up?! I want to be able to find trails that are similar to the kid friendly trails that we have hiked (and liked) before. Also, when thinking back to the days when we were pushing my daughter around in a stroller, I can definitely imagine that I would have been interested in finding additional trails that were similar to a stroller friendly trail that we have pushed (and enjoyed) before.

I built CaliforniaTrailFinder, a hiking trail recommender that’s designed to find trails that are “friendly” for certain users such as people with kids or dogs, or those with dependencies such as strollers or wheelchairs. There are a number of hiking website resources out there. Some provide information on the top/best trails by chosen location and allow filtering based on trail characteristics or tags, and the resulting list is based on an internal rating system. However, none allow you to generate a list of recommended trails based on a previously liked trail. Additionally, these top/best trails are provided within the same chosen location on the website. Further motivation for building CaliforniaTrailFinder was to be able to provide similar trail recommendations in different regions/locations compared to the region that was chosen. Similar trail recommendations in the same region/location can also be provided.

Data Collection and Database Management

The data used for building the recommender system were collected from one of the websites that provides information on the top/best trails by state. I focused on the state where I live, California, which had the greatest number of trails — more than 10,000!

The following trail attributes were collected and inserted into a MySQL database which included a mix of numeric and categorical features:

Trail ID
State
City
Trail Name
Trail Difficulty (Easy, Moderate, Hard)
Stars (Given by Reviewers)
Number of Reviews
Trail Region
Distance (Miles)
Elevation Gain (Feet)
Duration (Minutes)
Route Type (Loop, Out & Back, Point to Point)
Trail Tags (56 tags)

After gathering the trail information into the database, I read the data into a Python Pandas DataFrame to begin data processing/cleaning and exploratory data analysis. Initial examination showed that only a few trail features had missing values: City (n=3) and Duration (n=409; approximately 4% missing).

I performed some data cleaning such as ensuring that only California trails were included and also populated missing city names when possible. Additionally, I performed measurement conversions for both Distance (Miles) and Elevation Gain (Feet) metrics when measurement inconsistencies were found. Missing Duration (Minutes) turned out to be a systematic issue within the website and could not be cleaned. Finally, Trail Tags were converted to dummy variables to be used in the analysis.

In order to be able to provide trail recommendations based on a region or location, I needed to find a location measurement that would be useful to the user. I thought that County level measurements would be a good compromise to avoid too few or too many recommendations being provided. However, after my data collection, I did not have County as a measurement. But I did have City and I could create a crosswalk between California city and county. I found a website called SimpleMaps.com which included a United States Cities Database with city and county names as fields. I read in a CSV file of the database into Pandas and successfully mapped 78% of the unique cities that were in the trails data that I collected. The remaining cities were not Census-recognized cities/towns (unincorporated populated places). So, I Google searched the remaining 22% of cities in order to find the county.

After performing the data cleaning and joining the county information, I created a final trail information table in the MySQL database which included 10,144 trails and 69 trail features.

Exploratory Data Analysis

Exploring the data revealed that 31% of the California trails were rated as Easy, 50% were Moderate, and 19% were Hard.

Star ratings given by reviewers ranged from 0–5 stars in 0.5 increments with 55% having 4.5 stars and 27% having 4.0 stars.

The Number of Reviews ranged from 0 to over 5000 reviews with 22% of the reviews in the range of 1–10.

Trail Route Type consisted of 48% that were a Loop, 46% that were Out & Back, and 6% that were Point to Point.

The distributions of Distance and Elevation Gain were both heavily right skewed. The average trail Distance was 8.1 miles with a median of 5.1 miles. There were 96 trails whose Distance was greater than 50 miles.

The average trail Elevation Gain was 1,514.6 feet with a median of 862 feet. There were 85 trails whose Elevation Gain was greater than 10,000 feet.

Elevation Gain by Distance was plotted with each data point representing Trail Difficulty for comparison. While there does appear to be a clear clustering of data points by Trail Difficulty, we can also see that there is some amount of overlap between Trail Difficulty data points.

Model Development

I chose to implement the model using the Turi Create recommender system toolkit, an Apple open-source machine learning library in Python.

Using the trail information data that was extracted from the source website, I chose a content-based recommender model to provide personalized recommendations to users. More specifically regarding the recommender system, an Item Content Recommender was used to build the model because the data collected was associated with each item (i.e., trail), not the users. In other words, the similarity between the items recommended is determined by the content of those items instead of learned from user interaction patterns. A similarity score between two items can then be calculated.

The similarity metric that was used in the item content recommender was cosine similarity which measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.

In order to ensure that the recommender model will, for each specified trail, provide a similarity score for all other trails I set the ‘max_item_neighborhood_size’ equal to the total number of trail items.

The trail features that were included in the recommender model were the following:

Distance (Miles)
Elevation Gain (Feet)
Trail Difficulty_Easy
Trail Difficulty_Moderate
Trail Difficulty_Hard
Route Type_Loop,
Route Type_Out & Back
Route Type_Point to Point
Trail Tags (0 for “No”, 1 for “Yes”): Backpacking, Beach, BirdWatching, Blowdown, Bugs, Camping, Cave, CityWalk, Closed, DogFriendly, DogsOnLeash, Fee, Fishing, Forest, Hiking, HistoricSite, HorsebackRiding, KidFriendly, Lake, MountainBiking, Muddy, NatureTrips, NoDogs, NoShade, OffTrail, OhvOffRoadDriving, OverGrown, PartiallyPaved, Paved, PrivateProperty, River, RoadBiking, RockClimbing, Rocky, Running, ScenicDriving, Scramble, Snow, Snowshoeing, StrollerFriendly, Views, Walking, WashedOut, Waterfall, WheelchairFriendly, WildFlowers, Wildlife

Using the recommender modeling results, I then iterated through each of the California trails to recommend the ‘k’ highest scored items for each trail. I set ‘k’, the number of recommendations to generate, equal to the total number of trail items for the same reason described above. The recommendation information for each trail was then inserted into a table in the MySQL database so that it can easily be accessed.

As an example, the following shows the Turi Create recommendation output for a single trail_id where k=10. There are 3 columns returned with the trail_id, score, and rank. Below, I will show an example that takes this information and incorporates it into the recommender.

+—-----—---+--------------------+------+| trail_id |       score        | rank |+—-----—---+--------------------+------+| 10240053 | 0.9630441665649414 |  1   || 10339008 | 0.9614306092262268 |  2   || 10018798 | 0.9604242444038391 |  3   || 10337040 | 0.9596757292747498 |  4   || 10004470 | 0.9526501893997192 |  5   || 10318075 | 0.9453310966491699 |  6   || 10456397 | 0.9452914595603943 |  7   || 10004242 | 0.9452039003372192 |  8   || 10740648 | 0.9451684355735779 |  9   || 10683024 | 0.9451571702957153 |  10  |+—-----—---+--------------------+------+[10 rows x 3 columns]

Web Development/Deployment

In order to develop the trail recommender web app, I chose to leverage the Flask Python web app framework. Another question was how will the web app be deployed. Among the options that were available, I chose to use Amazon Web Services (AWS) because I had some experience before working on that platform. One of the AWS services that I used was Amazon Elastic Compute Cloud (EC2), which allowed me to have my web app available all the time and connected through the internet. Using EC2, I chose an Ubuntu server on a t2.micro instance and set up the Flask application with Gunicorn and Nginx. The web application functionality also incorporated HTML and CSS files. Finally, I connected my MySQL database to the server to pull and return the trail information from the tables.

I implemented my web app using a series of drop down lists which reflected choices by the user.

Similar to the couple of use cases that I described above, I decided to focus on recommending trails that are “friendly” for certain users. I categorized selections into Kid Friendly, Dog Friendly, Stroller Friendly, and Wheelchair Friendly using tags from the data.

The user would then navigate through drop downs to find the county and then the name of the trail that they have enjoyed in the past.

Next, the user can choose whether they want to find the similar trails in the same county or in a different county.

The recommender would then provide the user a ranked list of similar California trails along with trail details compared to the trail they liked before! The ranking represents the ordered similarity rank between the Selected Trail and Similar Trails within the selected county based on a calculated similarity score.

Let’s Take a Look at a Few Examples

I want to find a stroller friendly trail to go on a hike with my family. If I’ve been on a trail that was suitable for strollers and that I liked, then we can use CaliforniaTrailFinder to recommend similar ones. For example, I liked “Alamo Creek Park Trail” in “Alameda” county — an easy, out & back trail that is 2.1 miles long with a 65 foot elevation gain. First, I will select “Stroller Friendly” in the initial dropdown of preferences. Then, since I know where that trail was located, I will select the county in the next dropdown. After that, I will select the name of the trail that I have enjoyed in the past. The trail details of the selected trail will then populate.

Finally, I must decide: do I want to find similar trails in the same county or in a different county? I think I would like to hike nearby, so I choose the same county — “Alamada”. Now, the recommender system has all the information it needs to find similar trails and the next table shows the highest ranked trails in “Alameda” county!

Let’s take a look at the recommended trails to get an idea of the quality of the recommendations. Of the top 10 similar trails, all but one were “easy” trails and all but one were “Out & Back” trails. Distances and Elevation Gains were also relatively close compared to the selected “Alamo Creek Park Trail”.

Ok, my daughter’s out of a stroller now so let’s find some “Kid Friendly” trails. Suppose I liked “Big Bear Trail” in “Alameda” county — an easy, loop trail with a short 0.5 mile distance with 160 foot elevation gain. But this time, I want to take a road trip to “Los Angeles” county and find similar trails there. Let’s take a look at the recommended similar trails in “Los Angeles” county. All of the top 10 recommended trails were easy, loop trails. Distances were short too and the elevation gains were also relatively close compared to “Big Bear Trail”.

Final Thoughts

Recommendations are the basis for so many decisions in our daily lives. Whether it be food, restaurants, movies, or retail products, the list can go on and on. It’s also important to know that what is being recommended to us will be something that we will have a high confidence that we know we will like or enjoy. I built CaliforniaTrailFinder to find user-friendly trails and address these issues. I hope you can find some that will be suitable for you!

If I had more time to improve CaliforniaTrailFinder, some additional data could be incorporated into the build that would have potentially been helpful. User ratings of trails could have been included as part of the data collection process. Having user ratings would have allowed for exploring additional types of recommender models such as an Item Similarity Recommender. Additionally, weather data points could have been collected for each trail and included as additional model features.

The web app can be found live at http://CaliforniaTrailFinder.com/.

Project GitHub: https://github.com/tweichle/California-Trail-Finder