Where the pets are: using KMeans to cluster Toronto Neighbourhoods by Pet Ownership

A Gordon
DataExplorations
Published in
6 min readOct 16, 2018

Walk by any park in Toronto and you may be forgiven for thinking the city has almost has many dogs as people. This inspired me to take a look at pet ownership in Toronto for a recent project. For this project, we had to use the FourSquare API to solve a problem and I wondered where in Toronto would be a good place to open a pet services store? For this analysis, I assumed that good target areas would have a large population, a large number of pets registered in the area and very few existing pet services/shops.

The full source code and supporting reports for this project can be found in github — I’ll just walk through the highlights here.

Data Sources

As a proxy for the number of pets in an area, I used the Toronto Open Data Licensed Dogs and Cats Reports for 2013 and 2017. Since this data is organized by FSA (Forward Sortation Area — the first three characters of a Canadian Postal Code), I chose to do my analysis at that level. I was able to get population data for each FSA from the Canada 2016 Census

To make the FSAs more meaningful, I wanted to assign Neighbourhood names and used Beautiful Soup to scrape a Wikipedia page containing this information. I condensed the information so each FSA was associated with one to many neighborhoods, resulting in this data frame:

The next step was to use GeoPy to retrieve the latitude and longitude for each FSA.

def get_lats_longs(df):
'''
This function uses GeoCoder to retrieve the latitude and longitude for neighbourhoods.
Inputs:
df: dataframe to loop through. Adds new Latitude and Longitude columns to this dataframe
'''
lat_lng_coords = None
# create lists to store our new lats and longs
lats = []
longs=[]
#loop through our dataframe and look up the lat/long of each postal code
for index, row in df.iterrows():
postal_code=row[0]
# loop until you get the coordinates
lat_lng_coords = None
while(lat_lng_coords is None):
g = geocoder.bing('{}, Toronto, Ontario'.format(postal_code),key=bing_key)
lat_lng_coords = g.latlng
lats.append(lat_lng_coords[0])
longs.append(lat_lng_coords[1])
df['Latitude'] = lats
df['Longitude'] = longs
return df

Finally, I used the foursquare api to retrieve the existing pet stores and services in each area. With foursquare, you can pass in latitude/longitude coordinates, tell it how far to search around those coordinates (radius) and optionally specify category ids to target your search (i.e. for restaurants or pet stores in this case example). Foursquare will then return a list of the venues that match your criteria.

https://api.foursquare.com/v2/venues/explore?client_id=<id>&client_secret=<secret>E&v=X&ll=43.642960,-79.371613&radius=400&limit=100&categoryId=5032897c91d4c4b30a586d69,4bf58dd8d48988d100951735#categoryIds:
#5032897c91d4c4b30a586d69=pet services
#4bf58dd8d48988d100951735= pet stores

Data Clean up

  • Minimal data clean up was necessary on these datasets, beyond removing empty csv rows and converting the cat and dog license totals from string to float
  • However one outlier did have to be removed: FSA (M5W) has a population of only 15 people. This is significantly lower than the population of any other FSA (the next lowest had a population of 2000), so I dropped this row to avoid any skewing of the results

Exploratory Data Analysis

During the EDA, I combined the various datasets into one dataframe and calculated a few additional fields (such as Per Capita Pets (pets divided by population)). All the steps I went through are in the notebook in github, but this is the final dataframe on which I did most of my analysis:

Here are some plots highlighting the results of the EDA:

Total Pet Licenses Issued per FSA in 2017

  • Largest number of licenses were issued to FSAs outside downtown core: East York (M4C) and Etobicoke (M8V) had the highest numbers
  • Lowest numbers of licenses were issued to FSAs in the downtown core: M5C, M5H and M5W had the lowest numbers
Darker red areas have the largest number of pet licenses issued in 2017

Changes in Pet Licensing over the past 5 years (2013–2017)

  • This plot looks at the difference in the number of licenses issued in 2017 and 2013 (2017–2013) to isolate the areas experiencing the largest growth in pet licensing
  • Downtown Toronto (M5A and M5V) experienced the largest pet growth
  • Scarborough and Etobicoke (M1C and M9B) experienced the biggest decrease in pet licensing
Darker red areas showed the biggest increase in pet licenses issued between 2013 and 2017

Per Capita Pet Licensing

  • The highest per capita licensing of pets is in The Beaches (M4E), Parkdale/ Roncesvalles (M6R) and Alderwood/Long Branch(M8W)
Darker red areas have the large ratio of pets licensed in 2017 to population (i.e. per capita pets)

Existing Pet Services/Stores

  • Using the foursquare information, this plot shows the estimated number of pet services/stores in each FSA (within 1000m of the FSA center)
Darker red areas have the largest number of existing pet services/stores (according to FourSquare)

Cluster Analysis

In order to help identify areas of the city that are good candidates for new pet stores, I used the KMeans algorithm to attempt to cluster FSA areas based on three factors:

  • Population
  • Total Pets registered in 2017
  • Number of Existing Pet store/services

Since population, total pets and number of venues are all on very different scales, I used StandardScaler from Scikit learn to scale the results based on standard deviations from the mean.

To find the appropriate number of clusters, I used the visual Elbow method to find the point at which the score levels off — meaning that adding more clusters doesn’t dramatically reduce the error any further. I found that 7 clusters looked to be correct for this data

This map gives a good idea of where the clusters fall geographically

The following scatter-plot shows the characteristics of each cluster in terms of Population, number of existing venues and total pet licenses issued in 2017

The interactive version of this chart can be found here: https://ag2816.github.io/Toronto_Pet_Clusters_interactive.html

Interpretation

Examining this chart, Clusters 1 and 4 appear to be good candidates for new pet stores — they have low numbers of existing services, medium-high population and a medium-high number of licensed pets

In particular, the following FSAs within those 2 clusters jump out from the chart as being particularly good candidates

  • within Cluster 4: M2J, M9V,M1W, M6M
  • within Cluster 1: M6E, M9B

Of these clusters, M2J is the only one with an increase in pets licensed in the last five years and may be an especially good candidate for our new pet service/store

An interactive version of this chart can be found here: https://ag2816.github.io/Toronto_Pet_Clusters_joint_charts.html

So, if you’re planning open a new pet store, perhaps these are the areas you might want to target!

The full source code for this project can be found in GitHub

Of course, these results should be taken with caution! Since we don’t have a way to accurately measure actual numbers of pets living in Toronto, I’m using newly issued licenses as a proxy. But this may not be a perfect measure since 1) not every owner registers their pet 2) not all pet types are registered (i.e. there could be a cluster of rabbits living in one FSA that we have no idea about!) 3) we’re really only looking at licenses issued in 2017 which doesn’t cover pets registered in the past 10–20 years

--

--