Love hiking? Move to this city…

Published in

Spring 2019 — Information Expositions

6 min readFeb 11, 2019

Living in Colorado for most of my life and attending college in Boulder has turned me into someone that can appreciate living in a place where it’s easy to head to a local trail for a short hike or mix it up and tackle something a bit more difficult and rewarding. I’m also nearing the end of my time at the University of Colorado. So, if I want to be close to some good hiking, should I stay put or head somewhere else?

Boulder is known for its great access to hiking and biking trails, and it might be hard to beat, but let’s see what we can find.

Introduction

First, we’ll be using a Python Jupyter Notebook and the Hiking Project API to answer our questions. The Hiking Project is a website that is similar to AllTrails in that it is a site where users can go review trails, post their experiences, and rate a trail’s difficulty. To get an API access key, you need to sign up for an account, which is free. Access to the API is also free and generally unrestricted. To start, we’ll be using the following libraries to do our analysis:

import json
import requests
import pandas as pd
import numpy as np
import seaborn as sb
import geocoder
import time
from scipy import stats
import matplotlib.pyplot as plt

I pulled a list of top hiking cities from a National Geographic article and put them into a list. (Denver and Anchorage aren’t on the list, but following the opinion of myself and some classmates they deserve to be included)

cities = ['Salt Lake City, UT', 'San Francisco, CA', 'Portland, OR', 'Las Vegas, NV', 'Seattle, WA','Phoenix, AZ','Washington, D.C.',
'Philadelphia, PA', 'New York, NY', 'Austin, TX', 'Chicago, IL', 'Miami, FL', 'Boston, MA', 'Los Angeles, CA', 'Milwaukee, WI', 'Denver, CO', 'Anchorage, AK']

Generally, a request to the API will be formed like this:

payload = {'lat' : 40.0274,'lon' : -105.2519, 'key':private_key, 'maxResults':50, 'maxDistance':50}
json_response = requests.get(url = 'https://www.hikingproject.com/data/get-trails', params=payload).json()

The above code requests the 50 top rated trails within 50 miles of the lat/long coordinates give, which happens to be in Boulder.

A bit of geocoding

To get a list of trails for each city, we’ll loop through our list of cities and make a request to the API for each. However, in order to feed the cities into the API, they need to be lat/long coordinate pairs. To get the coordinates for each city, we can use the geocoder library:

cities_lat_long = {}
for city in cities:
    g = geocoder.arcgis(city)
    cities_lat_long[city] = g.latlng

These are the columns of information that are returned by the API and will be of use to us in the next step:

cols = ['ascent', 'conditionDate', 'conditionDetails', 'conditionStatus', 'descent', 'difficulty', 'high', 'id', 'imgMedium', 'imgSmall', 'imgSmallMed', 'imgSqSmall', 'latitude', 'length', 'location', 'longitude', 'low', 'name', 'starVotes', 'stars', 'summary', 'type', 'url','city']

Creating the dataset

Now, let’s create a pandas DataFrame to hold our data and populate it with the data the API gives us:

main_df = pd.DataFrame(columns = cols) #create the DataFrame
for city in cities_lat_long:
    time.sleep(2) #wait 2 seconds between requests    #get the data
    payload = {'lat' : cities_lat_long[city][0],'lon' : cities_lat_long[city][1], 'key':private_key, 'maxResults':500, 'maxDistance':50}
    json_response = requests.get(url = 'https://www.hikingproject.com/data/get-trails', params=payload).json()    #append that data to our main DataFrame
    df = pd.DataFrame.from_dict(json_response['trails'])
    df['city'] = city
    main_df = main_df.append(df)

Perfect. Now that we have the data pulled down, we should have a data frame that is about ~6200 rows long (as of this writing). This will be the data frame that we use for the rest of our analysis.

Analysis

One of the most useful features in this dataset is the average star review (out of 5 stars) that is provided with each trail. Using this metric, we can get a quick look at the average trail rating in each city. This should give us an overall idea of where the hiking is best out of the cities in our list. To utilize this feature, let’s create a chart that displays this:

sb.barplot(data=main_df, x='city', y='stars')
plt.xticks(rotation=90)

From the looks of it, San Francisco, Los Angeles, Denver, and Washington D.C. take the cake for the best hiking. Although it’s questionable that t-tests would be helpful in this situation, we can quickly find out that really none of these averages are statistically different than each other after running t-tests like the one below.

stats.ttest_ind(main_df[main_df['city'] == 'Denver, CO']['stars'],main_df[main_df['city'] == 'Phoenix, AZ']['stars'])

But people hike for different reasons and are looking for varying levels of difficulty. Chances are, not everyone is looking to summit a 14er and not everyone is looking for an easy stroll in the park. Luckily, our data also features data such as elevation gain and difficulty ratings. To dive into this deeper, we can filter our data by difficulty rating and then find how many miles of trail in a given city have that difficulty rating. After filtering the data, we then need to preform a group-by aggregation which will group our data by city. When we aggregate, we set the ‘length’ column to have a sum aggregation function which adds up the mileage of trails in each city. We can then plot this data:

#filter the data by easy and easy-moderate trails
easy_df = main_df[(main_df['difficulty'] == 'green') | (main_df['difficulty'] == 'greenBlue')]#perform the groupby aggregation
easy_df_city_gb = easy_df.groupby('city').agg({'length':'sum'}).reset_index()#plot the data
sb.barplot(data=easy_df_city_gb,x='city',y='length')
plt.xticks(rotation=90)

In this case, it looks like Chicago and Washington D.C. are the places to be for easy trails, with over 1000 miles of easy trails available. Similarly, we can plot which cities have the most moderately difficult trails:

Here, Denver and LA take the top spots for more difficult hiking. On the other hand, some hikers may be looking to get some higher altitude in while others would rather keep things relatively flat:

#In order to plot this data, we have to convert the elevations to floatsmain_df['high'] = main_df['high'].astype(float)
sb.boxplot(data=main_df, x='city', y='high')
plt.xticks(rotation=90)

If you’re looking for the largest range of trails, Los Angeles or Las Vegas will give you the most variety of elevation. On the other hand, Denver and Salt Lake City give you the most opportunity to get some high elevation hiking under your belt.

Conclusion

Of course, there is no one city that is the best place to move if all you want is great hiking. Each city offers its own unique set of trails and experiences. However, this dataset is rich with data that can help you determine where the best place to live is if all you want is great hiking. My next steps for this analysis will be to dive deeper into the other features available in the dataset and to look into the actual reviews and comments that are left on the pages for these trails. Analyzing these comments could provide some insight into the actual characteristics of trails in a given area.