Exploratory statistics: SF crime data and eviction notices

Samuel McCormick
Data Divas
Published in
6 min readNov 16, 2017

SF crime data

I began my exploratory analysis by reading in the San Francisco crime report data and examining the features it contained, listed below:

  • Date/time
  • Category of crime (ex. “Larceny/Theft”, “Burglary”)
  • Description
  • Day of week
  • Police department district
  • Resolution
  • Address
  • Latitude/Longitude of where crime occured

The data contains 878,049 individual crime reports spanning the period January 2003-May 2015. I next created some new features based on the timestamp that I thought would be helpful, such as extracting the day, month, year, and time from this variable and storing them as pandas datetime variables. I then plotted the relative frequencies of the categorical variables to get a feel for the distribution of the data:

Categories of crime, 2003–2015
Crimes by district, 2003–2015

Next, I delved into the online literature on GeoPandas, which is a powerful open-source package for working with and visualizing geospatial data. After some trial and error, I was able to load the San Francisco neighborhood boundaries data (https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h) into a “GeoDataFrame” and plot it:

Pretty cool! What if I wanted to see a plot of where some subset of crimes occurred overlaid on this map of the city? I arbitrarily decided to pick burglaries that occurred in 2003 and plot their locations on the map:

Excellent! I could already see how invaluable GeoPandas was going to be for defining spatial boundaries and producing compelling visualizations as the project progressed. Now that I had some sort of handle on how the basics of this package worked, I decided to move on to my next data source, eviction records (again available on the OpenDataSF website). The data contained numerous indicator variables corresponding to specific reasons for and steps taken during the eviction; to simplify the initial analysis, I decided to restrict the data to the following features:

  • Zipcode
  • File Date
  • Non-payment indicator (as reason for why tenant was evicted)
  • Illegal use indicator (same as above)
  • Nuisance indicator (same as above)
  • Neighborhood
  • Latitude/Longitude

After this trimming, the data contained 38,117 eviction records and 8 features. There were 1382 rows for which the lat/long information was missing, and I decided to drop these rows for now; depending on future work, we still might be able to use these records if we decide we only need records to be at the neighborhood level. As with the crime data, I plotted the relative frequencies of the categorical variables in the dataset (one example below):

SF evictions by neighborhood, 2003–2015

I wanted to plot some of these evictions as I had done with the crime data, but realized it would be a little bit tougher than for the crime data: the lat/long coordinates were in the form “(x, y)”, so I had to write a function to extract the coordinates using regular expressions and convert them to floats. Finally, I decided to plot 2003 evictions on the same map as 2003 burglaries (not because I suspected burglary was especially correlated with evictions, but more to see how overlaying two different geospatial data sources would look in practice):

Evictions (blue) and burglaries (red), 2003–2015

While this plot is interesting to look at, it doesn’t tell us whether evictions and crime are correlated in any sense. In order to investigate this relationship, I decided to restrict my analysis to a subset of both datasets (evictions and crime) which took place in the Mission district of San Francisco. What would be the simplest model possible to test this hypothesis? I decided to create a variable measuring crimes per month and evictions per month in the Mission from January 2003-May 2015, and run a linear regression of Y (crimes per month) on X (evictions per month) to measure the strength of the correlation:

What about a graphical representation of this relationship?

IMPORTANT DISCLAIMER: All we can conclude from this is that there appears to be a strong relationship (p-value of linear regression coefficient is essentially zero) between the number of evictions and the number of reported crimes in a given month. It would be dangerous to claim that “evictions predict crime” based on this very simple analysis, since there are likely other underlying variables (unemployment? poverty?) that is responsible for causing higher crime higher eviction rates. This is nevertheless encouraging because there is a clearly a relationship between evictions and crime, and therefore eviction rates will be a useful proxy variable for approximating the underlying causal variables for high crime in a given district and time period (especially given the fact that district-level unemployment and poverty data would likely be difficult to obtain).

With a lot of exploratory work done to get a better feel for the data we have and the models that suit the data, I decided to think of future steps. What can we do if our model works (wish I could say “when” instead of “if” but that still sounds a little optimistic currently)? Who can make optimal use of our model given the inputs it takes, besides the police departments?

While the police departments can take preemptive action to prevent the crime incident from occurring, state and federal governing agencies can impact crime on a more overarching level. They can plan and implement policies that can directly reduce crime by realizing the more prominent types of crime incidents and their respective results.

Hence, I first decided to explore “Resolution” data by first counting the uniques and got these results:

On plotting the histograms, we can observe the following

Intuitively, certain crime incident resolutions seemed like they could be more useful to agencies developing policies. These resolutions were:

  1. NONE
  2. ARREST, BOOKED
  3. ARREST, CITED
  4. PSYCHOPATHIC CASE
  5. JUVENILE CASE

I decided to look at ratios for these resolutions for reported crime incidents, which account for 95.15% of the total resolutions (joined_ratio) and obtained the following:

Why are 60% of reported resolutions no resolutions? Why is 1 in every 100 reported crimes incidents committed by a juvenile? These are just 2 of the many questions that arise, indicating areas to look at to curb crime by targeting different segments of society.

I also came across a pedestrian volume database that provides an average of the number of pedestrians at every intersection of SF (this was quite creepy tbh). This dataset consists of 8135 rows of street intersections with the estimated number of pedestrians using those streets on a yearly basis. Plotting this using geopandas gives a good visualization of the grid of streets represented by the number of pedestrians walking there.

Layering this on top of the SF map that we used earlier and combining it with the burglaries data, we can see an interesting visualization.

This tells us that burglaries are more common in the more dense areas, closer to the north-east of SF, which could be due to one or more of these underlying factors:

  1. Burglaries are simply more in that region because there are more people there. This implies that the number of burglaries are proportionate to the population of a region
  2. The more dense areas are more urbanized and generally wealthier than the average. This implies that the number of burglaries are proportionate to the wealth of a region

--

--