SF Reported Crimes and Rental Prices


SF rents have been steadily rising since 2000 to 2010 when 28,500 newcomers arrived to the city. From 2010–2012, 20,600 additional people moved in, in just three years. People have attributed the growth of population to the blazing technology sector.

While the demand for housing increased, supply was relatively fixed due to zoning laws. The most fundamental rule of economics dictates that prices rise when there is a high demand but low supply. I was interested in seeing how crime affects prices in certain neighborhoods and if there was a relation between price and crime.

Hypothesis: Can I predict the rental prices in districts of San Francisco based on crime counts?

Data: Data was retrieved from a multitude of sources. Eric Fischer’s github repo of aggregated Craigslist rental data and the SF Open Data Crime dataset. The features extracted from the postings data include price, bedrooms, year, and district.

Distribution of listings per year

Limitations: The rental data is extremely limited by the fact that not each year and district are represented evenly. Also one needs to keep in mind that Craigslist data contains a sizable amount of fake posts. While this analysis does not have the best integrity as a result of data limitation, this project comes to demonstrate the possibilities in this analysis.

Extraction: Another issue with the rental data is that the structure is inconsistent so parsing it out using split methods isn’t practical. Instead I needed to use regex. In addition the way that the districts are written vary, to counter this problem I used scoring algorithms vs a dictionary of San Francisco districts to standardize my data.

Original Craigslist Data
REGEX and Transformations
Parsed Craigslist Data
SF crime data set from opendata

Merging: Crime dataset had different districts of SanFrancisco (Police Districts) than the rental data districts.

scraped longitude latitudes to merge

This created a problem since there were no common columns to perform a merge on. To solve this issue I scraped coordinates for each rental district center and wrote a function that calculates the closest distance to the crime location coordinates.

code to determine which district a crime data point is in

EDA: To complete the analysis I explored trends of reported crime and rent in different neighborhoods and displayed the results with exploratory graphs.

crime by category counts

Predictive modeling: I used random forest with regression to predict prices of housing in San Francisco using districts, years, bedrooms, and aggregate crime counts per district per year. The graph below indicates the feature importances overall, in other words, which variables have the most predictive power. The largest feature importances are bedrooms, year, and district indicating that many of the predictive power come from these features rather than the crimes happening in neighborhoods.

Feature importances for Random forest

The graph below visualizes the performance of the model. The X axis solely refers to an individual sample. The red line is the predicted rental price for each sample and the blue dots signify the actual rental price for that data point. The green lines are the residuals for each prediction.

Predictive Power Graph

Final thoughts: I grouped the data by each district and ran the predictive rental price model on these subsets. The result was that each district had different crime categories that had predictive power. This indicates that a more complex phenomenon with rental prices is occurring. One possible explanation can be that types of crimes that are happening in certain neighborhoods are affected by prices of other neighborhoods. To confirm this hypothesis, a more in depth analysis would need to performed with a higher quality dataset.

Highest feature importances: Drug/Narcotic

Mission District Feature importances

Highest Feature importances: Stolen Property, WeaponLaws, Larceny/Theft

SOMA Feature Importances

Mapping: I made a heat-map of prices in SF and clustered crimes.