Machine Learning

Samarth Goenka
Data Divas
Published in
5 min readDec 4, 2017

Developing a good predictive machine learning model is arguably the most important part of our project. This model will generate class probabilities that can be interpreted as relative probabilities of each type of crime occurring in each of the San Francisco census tract zones; we will then experiment with different threshold values to optimize our model’s specificity/sensitivity (aka true positive vs. false negative rates).

Based on our research, we have identified certain factors that we expect to be correlated with future crime. The features we will include in our machine learning model are the following:

- Census tract number (one hot encoded)

- Day of week, month, year, and hour (one hot encoded)

- Prior crime information: binary variables indicating whether each type of crime occurred in that census tract in the last 1–5 hours, 1, 2, 3, 7, and/or 14 days

The categorization of crimes are:

Property crime (burglary, larceny/theft, vehicle theft, arson, vandalism, stolen property, embezzlement)

Violent crime (assault, sex offenses, kidnapping)

Robbery (its own category)

Other crime (everything else)

- Eviction information: binary variables indicating whether an eviction occurred in that census tract in the last 1–5 hours, 1, 2, 3, 7, and/or 14 days. We expect this to be a good proxy for people struggling financially, which might then be good predictor for crime.

- 311 information: binary variables indicating whether an public complaint was made in that census tract in the last 1–5 hours, 1, 2, 3, 7, and/or 14 days

Once our base classifier is working correctly, we’ll experiment with bagging (training multiple classifiers on random samples and then aggregating their predictions, which should reduce variance/overfitting) and voting classifiers which should have a similar effect.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Update

Due to time and computation power restrictions as master’s students, we decided to stay away from neural nets and instead experiment with other types of ML classifiers. For each crime in our original dataset, our labeled target data consists of binary indicators representing whether a crime was committed (represented by value 1) or not committed (represented by value 0) within the next 1 hour in the same census tract. This target value was broken down into columns categorized by crime type. Our feature data consists of one hot encoded representations of the hour, day of week, month, year, and census tract the crime was committed in, as well as the count of crimes in the last 1, 2, 3, 4, and 5 hours and in the last 1, 2, 3, 7, and 14 days. Our objectives were to build a classifier to predict

1. Whether any type of crime would occur in a given census tract in the next one hour

2. Whether a property crime, violent crime, robbery or other crimes (none of these first three) would occur in a given census tract in the next one hour

We decided to split our training dataset into training and testing data such that the earliest crime in the testing data happened chronologically after the last crime in the training data. This seemed like the most realistic scenario as our classifier can only use past data to predict future crimes. There’s not much point in using early future data to predict later future data because we don’t have any future data! I said “future” too many times, right? Just rewind to the past before I said future in the previous sentence.

Like most machine learning engineers in this age and time, we started by using Random Forest (RF) and Logistic Regression (LOG) classifiers, and then moved onto Bagging (BAG), Adaptive Boosting (AB) and then realized that Gradient Boosting (GB) worked best for predicting any crime but a voting classifier containing RF and LOG worked best for predicting specific crime types. These were optimized using GridSearchCV and RandomizedSearchCV.

The accuracy and AUC scores for predicting future crimes are:

Any crime:

RF: accuracy, ROC: 0.844335511983, 0.74520527806
LOG: accuracy, ROC: 0.847167755991, 0.749912334834
BAG: accuracy, ROC: 0.846732026144, 0.743884946276
AB: accuracy, ROC: 0.847930283224, 0.751028568317
Gradient Boosting: accuracy, ROC: 0.846840958606, 0.754504682736

Property crimes:

RF: accuracy, ROC: 0.9348583878, 0.710901846861
LOG: accuracy, ROC: 0.934531590414, 0.750265709135
BAG: accuracy, ROC: 0.934749455338, 0.74019706931
AB: accuracy, ROC: 0.933660130719, 0.751602669173
Gradient Boosting: accuracy, ROC: 0.933551198257, 0.759220539846

Violent crimes:

RF: accuracy, ROC: 0.982461873638, 0.713465155341
LOG: accuracy, ROC: 0.982244008715, 0.721896975261
BAG: accuracy, ROC: 0.982461873638, 0.716105199582
AB: accuracy, ROC: 0.982461873638, 0.699935402074
Gradient Boosting: accuracy, ROC: 0.981154684096, 0.727984882157

Robberies:

RF: accuracy, ROC: 0.997167755991, 0.70501756273
LOG: accuracy, ROC: 0.997167755991, 0.641409388078
BAG: accuracy, ROC: 0.997167755991, 0.698500865532
AB: accuracy, ROC: 0.997167755991, 0.655314616561
Gradient Boosting: accuracy, ROC: 0.994989106754, 0.711282163325

Other crimes:

RF: accuracy, ROC: 0.904139433551, 0.743496902997
LOG: accuracy, ROC: 0.902505446623, 0.762833927434
BAG: accuracy, ROC: 0.90348583878, 0.744057819764
AB: accuracy, ROC: 0.902069716776, 0.758900529947
Gradient Boosting: accuracy, ROC: 0.903703703704, 0.765535029344

Gradient Boosting ROC predicting any crime in the next one hour

The ROC’s for predicting specific crime types in the future can be summarized below. These were obtained using LOG and RF

Voting Classifier ROC predicting property crime one hour in the future
Voting Classifier ROC predicting violent crime one hour in the future
Voting Classifier ROC predicting robbery crime one hour in the future
Voting Classifier ROC predicting other crime one hour in the future

Although our classifiers’ AUC score is not quite in the 0.9–1 range, we’re very happy with how our model is working. We’re in the process of increasing the size of our training/testing set, and are confident that with more data and more time spent tuning our model we will be able to get an AUC score above 0.8.

--

--