Predicting Arrests: Looking into Chicago’s Crime through Machine Learning
Analyzing crime in Chicago from 2012–2017 with Decision Trees, Logistic Models, and Random Forest Classifiers.
By: Carley Williams
For this project, I used Kaggle’s dataset: Chicago Crimes 2012–2017. Before applying any models, I first needed to clean and explore my data.
Step 1: Exploring Data
To begin the exploration process, I followed many of the same steps I took in my previous publication from March 7th, “Analyzing Chicago’s Crime Rates”.
This dataset has 1456714 rows and 23 columns. These columns describe everything about each unique crime, designated by row. From latitude and longitude to FBI code to arrest, each of the 23 columns provides more description for the crime.
For use of my project, I narrowed things down to 16 columns:
Date: listed date of crime
Block: block where crime occured
IUCR: four digit Illinois Uniform Crime Reporting (IUCR) codes
Description: Short description of the type of crime
Location description: Description of where crime occured
Arrest: boolean value (T/F) of whether or not an arrest was made
Domestic: boolean value (T/V) of whether or not crime was domestic
Community Area: numeric value indicating area in community where crime occured
FBI Code: numeric code indicating FBI crime categorization
X & Y Coordinate: exact location where crime occured
Year: Year crime occured
Latitude & Longitude: latitude and longitude information of crime
Step 2: Cleaning Data
Part 1: Checking for nulls
To begin the cleansing process to prepare my data for analysis, I wanted to first check if I had any nulls. To do this, I first checked which columns contained null values. I found that several did: Location Description, District, Community Area, X & Y coordinates, and both Latitude & Longitude. To determine the best way to deal with these nulls, I wanted to see what percentage of each column were nulls, respectively.
To find the proportion of null values in each column, I printed a percent of nulls persisting in each column. I found that nearly every column was less than 1%, but X & Y coordinates and latitude and longitude were nearly 20% each. Thus, I decided to drop those columns entirely and just drop rows from each other column. After checking this, I had no more null values and was ready to continue analysis.
Part 2: Checking for categorical data
Next, I needed to change some column data types to be able to correctly analyze them. To do this, I saw which columns were categorical instead of just numeric. By printing each unique value of the columns, I could see that IUCR, Community Area, FBI Code, and Primary Type were categorical.
I switched these columns' data types to categories, checked they were converted, and then was able to move forward with my analysis.
Part 3: Checking for duplicate rows
Next, I wanted to see if I had any duplicate data rows and remove them. I found 3238 rows. I then deleted these by keeping the first instance.
Part 4: Setting date column to DateTimeIndex
To be able to analyze by date, I wanted to change the date column to DateTimeIndex. I changed this and checked its completion. This will allow me to use python date-dependent features.
Step 3: Beginning Analysis
Part 1: Creating a more succinct data frame
After exploring my data, I decided I wanted to look at the relationship between different variables and the arrest rate. I know this was a boolean value and I thought it could be interesting to look into how things like primary type, domestic status, community area, year, etc., impacted arrest rate.
To begin, I narrowed my dataframe into a few columns I was interested to look into.
Selected Columns: Arrest, Primary Type, Domestic, Community Area, Year
Part 2: Creating dummy variables
Next, I created dummy variables for my non-numeric data to be able to analyze it. I created dummy variables for Primary Type, Arrest, and Domestic.
Part 3: Setting target and feature variables
As I wanted to explore the relationship different variables had on arrest rate, I set my target and feature columns to the following:
Target Column: [‘Arrest_True’]
Feature Columns: [‘Community_Area’, ‘Year’, ‘Domestic_True’, ‘Domestic_False’, ‘Primary_Type_THEFT’, ‘Primary_Type_BATTERY’, ‘Primary_Type_CRIMINAL DAMAGE’, ‘Primary_Type_NARCOTICS’,’Primary_Type_ASSAULT’, ‘Primary_Type_DECEPTIVE PRACTICE’, ‘Primary_Type_OTHER OFFENSE’, ‘Primary_Type_BURGLARY’, ‘Primary_Type_MOTOR VEHICLE THEFT’, ‘Primary_Type_ROBBERY’]
These target and feature columns would later allow me to predict and analyze models’ capabilities to predict whether or not a crime would be an arrest.
Part 4: Splitting X and y into training and testing data
Finally, I could split my data into training and testing data. I decided to use a random state of 0 and a test size of .25 for this analysis.
Step 4: Exploring Relationships
Before applying any models, I wanted to see what the correlation was between variables. I first looked at the correlation between all variables, but there was too much data to be able to clearly see any relationships. Thus, I narrowed it down to see the correlation between a true arrest and each of the other variables outlined in my data frame.
To visualize this, I created a heatmap. However, there were too many variables that made analysis difficult.
To be able to simplify this, I wanted to narrow primary type into the most common crimes. To do this, I plotted a seaborn graph of the most frequent crimes.
Now that I knew the most common crimes, I could narrow my heatmap down. I made a new dataframe with Arrest, Community Area, Year, Domestic status, and the four most common crimes: Theft, Battery, Narcotics, and Criminal Damage. Now, my heatmap was much easier to analyze and understand. It looked like strong correlations between arrest and other variables existed with theft, criminal damage, and narcotics, but I wanted to print some numbers to double-check.
To do this, I created a new correlations variable and printed the outcomes between these variables and Arrest True. This shows the strongest correlation between theft, criminal damage, and narcotics, confirming my analysis of the heatmap.
Step 5: Applying a Dummy Classifier Model
The first model I chose to look at was a Dummy Classifier Model to provide myself a baseline to judge future models on.
Part 1: Importing & fitting the model
To begin, I imported and fit the Dummy classifier model using the strategy “most frequent”. This would be “no arrest”.
Part 2: Evaluating Dummy Classifier
After fitting my model, I calculated an accuracy score of around 74.11%. This gives me a baseline accuracy to base my other models on to see if I can improve.
Part 3: Creating Visualizations
To create visualizations for the dummy tree model, I made a confusion matrix to show the predictions the model made.
Step 6: Applying a Logistic Model
Part 1: Importing & fitting the model
As my predictor variables are categorical and my output variable (arrest vs not arrest) is more categorical in nature rather than continuous numeric, I decided logistic regression was better than linear for the type of analysis I wanted to do. Thus, for my next model, I chose to look at a logistic regression.
To begin, I first imported the LogisticRegression model and then fit it to my data.
Part 2: Evaluating logistic model
To evaluate this model, I created an accuracy score for predicting arrest. I found an accuracy of 85.66%! This was an improvement from the baseline 74.11% gathered from my dummy classifier.
Part 3: Creating Visualizations
To look more into this model, I created a confusion matrix showing the model’s prediction of arrest rate. You can see the largest value is a true negative: a true no arrest rate prediction (260522). The second-largest value was a true positive: a true arrest prediction (51086). There was a significant amount of false negatives (43094). I argue that false positives are worse than false negatives in this business case: if a crime company or the Chicago PD was to predict someone would get arrested, but they did not, it could persuade them to arrest someone that perhaps does not deserve it. In this model, there is a 3.36% false-positive rate of all the cases that were no arrests.
Beyond my model's accuracy, I wanted to look at some more visualizations about how arrest rate changes for different variables. To do this, I created contingency tables and mosaics for the strongest correlating variables with arrest rate: primary type theft, primary type narcotics, and primary type criminal damage.
Contingency Table 1: Theft Crimes vs Arrest
Contingency Table 2: Narcotics Crimes vs Arrest
Contingency Table 3: Criminal Damage Crimes vs Arrest
Step 7: Applying a Decision Tree Model
Part 1: Importing & fitting the model
The next model I wanted to look into was a decision tree. First, I imported DecisionTreeClassifier from sklearn.tree and then set the criterion to Gini and fit my data.
Part 2: Evaluating decision tree model
To evaluate the decision tree model, I found its accuracy score for predicting arrest. I found a slightly higher accuracy (85.763%) than with my logistic model (85.66%). This was also an improvement from the baseline dummy classifier accuracy of 74.11%.
Part 3: Creating Visualizations
To create visualizations for the decision tree model, I started with a confusion matrix. I saw that the most populated space was true negative (261334) and true positive (50633), however, there were a significant amount of false negatives (43547). This was similar to the logistic model. Concerning false positives, there is a slightly lower rate than compared to the Logistic Model: this model has a 3.06% false-positive rate of all non-arrests, where the Logistic Model had a 3.36%
The decision tree I initially made was very large and was not realistic for analysis. This eliminated the benefit of the decision tree of being able to see the splitting and nodes.
Part 4: Pruning the Tree
To make this tree more effective for use, I needed to prune it down. To do this, I decided to change max depth. I tried different combinations of depths and found the max depth of 4 to still have great accuracy (83.30%, slightly less than the 85.76% accuracy with the larger tree), but be more realistic for analysis. This led me to believe it was not under-fitted, as the accuracy was still close to the much larger tree.
Part 5: Finding the most important features
Finally, I wanted to take a look into what were the most important features of this model. I created a dataframe of features and their importance and printed the top 3.
Step 8: Applying a Random Forest Model
Part 1: Importing & fitting model
To begin with the Random Forest Classifier, I imported it and also imported accuracy score, classification report, recall score, precision score, and accuracy score. I then fit my model with 100 random estimators and a random state of 0.
Part 2: Assessing Accuracy
After this, I could fit my model and calculate predict_rf, recall_rf, precision_rf, and accuracy_score.
I found an accuracy score of predicting arrest of 85.762%, which was just slightly lower than the Decision Tree accuracy but higher than the Logistic Model. Further, it was higher than the baseline accuracy of the dummy classifier model (74.11%).
Part 3: Creating Visualizations
To create visualizations for the random forest model, I wanted to create a confusion matrix to show its accuracy. It shows the most predictions as true negatives (261254), followed by true positives (50709), and then a significant amount of false negatives (43471). This was similar to previous results found with the logistic and decision tree models. Concerning false positives, this model had a slightly larger amount (3.08%) than the decision tree but lower than logistic.
Part 4: Adjusting Features in Random Forest Model
Finally, I wanted to look at the impact of adjusting the number of features in the random forest model.
First, I looked at a large number of max features: 10. With this, I got an accuracy score of 85.762%. This was just slightly higher than the previous forest model.
Next, I looked at a smaller number of max features to see how accuracy would change. I changed max features to 2. This did not adjust accuracy significantly from 10 features, suggesting the number of features from 10 to 2 is insignificant with the accuracy of this model.
Step 9: Selecting a final model
To select the final model I wanted to look at two main things: accuracy rate and amount of false positives. As a false positive could mean a bias toward false arrest, I want to select the model with the lowest false positive rate and highest accuracy. To help me make this, I wanted to write out some summarizing info.
Model 1: Dummy Classifier Model
Accuracy: 74.11%
False Positive Rate: 0% *Classifies all as non arrests
Model 2: Logistic Regression Model
Accuracy: 85.66%
False Positive Rate: 3.36%
Model 3: Decision Tree Model
Accuracy: 85.763%
False Positive Rate: 3.06%
Model 4: Random Forest Model
Accuracy: 85.762%
False Positive Rate: 3.08%
Overall, I believe the Decision Tree model is the best model for this case. It balances the lowest false positive rate (aside from the Dummy Classifier), with the highest Accuracy. Further, I believe the ability the decision tree model lends of visualizing how nodes split based on feature importance is useful for a user to understand the likelihood of arrest. Overall, I believe this model and its visualizations are the most powerful and useful in this case and would recommend its use.