City Recommender System with Python [Part 1/3]: Finding my Schitt’s Creek

Elias Melul
4 min readMay 5, 2020

--

Introduction and Problem Statement

Have you every wondered if you should move to a place for a specific job? Whether you’d be happy in that place if you were to move?

As I look, apply and interview for jobs, I am constantly wondering this myself.

The US has over nineteen-thousand cities, towns and villages. About 300 of them are considered at least medium size with populations above 100,000. It’s easy to be overwhelmed when deciding to move, but it is especially true when you’re not sure what to expect from the place you might be moving to.

Most people know in some way or another cities like New York, Boston, San Francisco, Chicago or Miami, but are otherwise unaware of the multitude of incredible cities in the US. People may be considering taking a job in a city they know little about, without knowing how similar it is to the cities they know they like or dissimilar from cities they know they dislike.

As an example, I love Boston and really like New York. I also really like Raleigh and Durham in North Carolina, but had I not attended Duke University, I would have not known. So now that I am graduating and considering job offers… where should I consider taking a job based on my preferences?

My objective for this project is just that: building a recommendation system that, based on my input (or yours!), recommends cities similar in socioeconomic, weather and cultural aspects.

Data Requirements and Collection

Cities and Weather Data

I found this weather website that contains the monthly average high temperature, average low temperature, average precipitation (inches), and average snowfall (inches) for ~5800 cities and towns in the US.

Using beautifulSoup, I first scraped the name of cities for each state, along with the respective URL to the weather infromation for said city.

Once I had that, I created a function leveraging pandas (read_html) that collects temperature, precipitation and snowfall information on a city. I looped through all the URLs and… voilà!

Figure 1: Weather Data for 5 cities in NY state and for January, February and March

Direct download of Weather data (.csv): Click Here
Scraping code for Weather data (.ipynb): Click Here

General Socioeconomic Data

Awesome! Using the list of cities scraped in the weather section, we can generate URLs for datausa.io following these conventions:

https://datausa.io/profile/geo

  • + /city-name-state_abbretiation (or)
  • + /city-name-state_abbreviation-metro-area

However, larger cities were defined as combination of areas, which we had to scrap from the directory as metropolitan statistical areas (MSAs).

Ex. Miami-Fort Lauderdale-West Palm Beach, FL — https://datausa.io/profile/geo/miami-fort-lauderdale-pompano-beach-fl-metro-area

From it, we scraped the following variables for all the cities possible from the generated URLs:

  • Population and Population Change (Year to Year)
  • Poverty Rate
  • Median Age
  • Median Household Income and Median Household Income Change (Year to Year)
  • Number of Employees and Number of Employees Change (Year to Year)
  • Median Property Value and Median Property Value Change (Year to Year)
  • Average Male and Female Salary, and a ratio of Average Male to Female Salary
  • Gini coefficient in 2017 and 2018, as well as it’s change (Year to Year)
  • Ratio of Patients to Clinicians (county-wise)
  • Foreign-born population percentage
  • Citizen population percentage
  • Total degrees awarded in 2018 (higher education)
  • Male to Female ratio of awarded degrees
  • Number of degrees per capita
  • Number of households in city
  • Population per household (people per household)
  • Homeownership Percentage (Rent vs Own)
  • Average Commute Time (minutes)

Safety, Health Care and Climate Indices, among other quality of life indices, we also scraped from Numbeo.

Figure 2: Numbeo metrics scraped

Incredible! With the weather and socioeconomic data, I could potentially build a good recommendation system, but it seems “too objective”. So I decided to add a little flare to it by adding top venue data from Foursquare.

Direct download of General Socioeconomic data (.csv): Click Here
Scraping code for General Socioeconomic data (.ipynb): Click Here

Venue Data

Using Foursquare’s API, I collected the top 100 best rated venues in each city, and created a frequency count of how many times each type of venue appeared in each city. This allows us to add subjectivity to the recommendation system and somewhat functions as a proxy for cultural aspects in a city.

Figure 3: Venues in each city
Figure 4: Count of venues types per city

Direct download of Venue data (.csv): Click Here
Scraping code for Venue data (.ipynb): Click Here

In Part 2 of this story, we will explore and transform the data to feed our recommendation system! Until next time (:

https://github.com/eliasmelul/finding_schitts
https://github.com/eliasmelul/finding_schitts

--

--