Predicting Arrests: Looking into Chicago’s Crime through Machine Learning

Analyzing crime in Chicago from 2012–2017 with Decision Trees, Logistic Models, and Random Forest Classifiers.

Carley Williams
Analytics Vidhya
11 min readMay 8, 2021

--

By: Carley Williams

Chicago Skyline

For this project, I used Kaggle’s dataset: Chicago Crimes 2012–2017. Before applying any models, I first needed to clean and explore my data.

Step 1: Exploring Data

To begin the exploration process, I followed many of the same steps I took in my previous publication from March 7th, “Analyzing Chicago’s Crime Rates”.

This dataset has 1456714 rows and 23 columns. These columns describe everything about each unique crime, designated by row. From latitude and longitude to FBI code to arrest, each of the 23 columns provides more description for the crime.

For use of my project, I narrowed things down to 16 columns:

Date: listed date of crime

Block: block where crime occured

IUCR: four digit Illinois Uniform Crime Reporting (IUCR) codes

Description: Short description of the type of crime

Location description: Description of where crime occured

Arrest: boolean value (T/F) of whether or not an arrest was made

Domestic: boolean value (T/V) of whether or not crime was domestic

Community Area: numeric value indicating area in community where crime occured

FBI Code: numeric code indicating FBI crime categorization

X & Y Coordinate: exact location where crime occured

Year: Year crime occured

Latitude & Longitude: latitude and longitude information of crime

Step 2: Cleaning Data

Part 1: Checking for nulls

To begin the cleansing process to prepare my data for analysis, I wanted to first check if I had any nulls. To do this, I first checked which columns contained null values. I found that several did: Location Description, District, Community Area, X & Y coordinates, and both Latitude & Longitude. To determine the best way to deal with these nulls, I wanted to see what percentage of each column were nulls, respectively.

Shows the existence of null values in 7 columns

To find the proportion of null values in each column, I printed a percent of nulls persisting in each column. I found that nearly every column was less than 1%, but X & Y coordinates and latitude and longitude were nearly 20% each. Thus, I decided to drop those columns entirely and just drop rows from each other column. After checking this, I had no more null values and was ready to continue analysis.

After removing and cleaning data, I had no more null values and could continue with the analysis.

Part 2: Checking for categorical data

Next, I needed to change some column data types to be able to correctly analyze them. To do this, I saw which columns were categorical instead of just numeric. By printing each unique value of the columns, I could see that IUCR, Community Area, FBI Code, and Primary Type were categorical.

This shows the unique values of each column. By scrolling over to the right, I could tell which had a finite number of categorical values and which did not.

I switched these columns' data types to categories, checked they were converted, and then was able to move forward with my analysis.

Part 3: Checking for duplicate rows

Next, I wanted to see if I had any duplicate data rows and remove them. I found 3238 rows. I then deleted these by keeping the first instance.

This shows the number of duplicate rows found: 3,238

Part 4: Setting date column to DateTimeIndex

To be able to analyze by date, I wanted to change the date column to DateTimeIndex. I changed this and checked its completion. This will allow me to use python date-dependent features.

This shows data types and the change of Date to an object from a string

Step 3: Beginning Analysis

Part 1: Creating a more succinct data frame

After exploring my data, I decided I wanted to look at the relationship between different variables and the arrest rate. I know this was a boolean value and I thought it could be interesting to look into how things like primary type, domestic status, community area, year, etc., impacted arrest rate.

To begin, I narrowed my dataframe into a few columns I was interested to look into.

Selected Columns: Arrest, Primary Type, Domestic, Community Area, Year

Part 2: Creating dummy variables

Next, I created dummy variables for my non-numeric data to be able to analyze it. I created dummy variables for Primary Type, Arrest, and Domestic.

Part 3: Setting target and feature variables

As I wanted to explore the relationship different variables had on arrest rate, I set my target and feature columns to the following:

Target Column: [‘Arrest_True’]

Feature Columns: [‘Community_Area’, ‘Year’, ‘Domestic_True’, ‘Domestic_False’, ‘Primary_Type_THEFT’, ‘Primary_Type_BATTERY’, ‘Primary_Type_CRIMINAL DAMAGE’, ‘Primary_Type_NARCOTICS’,’Primary_Type_ASSAULT’, ‘Primary_Type_DECEPTIVE PRACTICE’, ‘Primary_Type_OTHER OFFENSE’, ‘Primary_Type_BURGLARY’, ‘Primary_Type_MOTOR VEHICLE THEFT’, ‘Primary_Type_ROBBERY’]

These target and feature columns would later allow me to predict and analyze models’ capabilities to predict whether or not a crime would be an arrest.

Part 4: Splitting X and y into training and testing data

Finally, I could split my data into training and testing data. I decided to use a random state of 0 and a test size of .25 for this analysis.

Step 4: Exploring Relationships

Before applying any models, I wanted to see what the correlation was between variables. I first looked at the correlation between all variables, but there was too much data to be able to clearly see any relationships. Thus, I narrowed it down to see the correlation between a true arrest and each of the other variables outlined in my data frame.

This shows the correlation between True Arrest and other variables. I can see that the strongest correlation is Narcotics, by far. Other strong correlations were Criminal Damage, Deceptive Practice, Criminal Trespass, Theft Weapons Violation, and prostitution.

To visualize this, I created a heatmap. However, there were too many variables that made analysis difficult.

As you can see, with so many variables, it is difficult to see where stronger correlations are.

To be able to simplify this, I wanted to narrow primary type into the most common crimes. To do this, I plotted a seaborn graph of the most frequent crimes.

Here you can see the most common crimes. The 10 most common are: Theft, Battery, Narcotics, Criminal Damage, Assault, Other Offense, Burglary, Deceptive Practice, Motor Vehicle Theft, and Robbery.

Now that I knew the most common crimes, I could narrow my heatmap down. I made a new dataframe with Arrest, Community Area, Year, Domestic status, and the four most common crimes: Theft, Battery, Narcotics, and Criminal Damage. Now, my heatmap was much easier to analyze and understand. It looked like strong correlations between arrest and other variables existed with theft, criminal damage, and narcotics, but I wanted to print some numbers to double-check.

This is a heatmap showing correlations between the four most common crimes, arrest, domestic, and year.

To do this, I created a new correlations variable and printed the outcomes between these variables and Arrest True. This shows the strongest correlation between theft, criminal damage, and narcotics, confirming my analysis of the heatmap.

This shows the correlation between arrest true and the four most common crimes, year, and domestic.

Step 5: Applying a Dummy Classifier Model

The first model I chose to look at was a Dummy Classifier Model to provide myself a baseline to judge future models on.

Part 1: Importing & fitting the model

To begin, I imported and fit the Dummy classifier model using the strategy “most frequent”. This would be “no arrest”.

Part 2: Evaluating Dummy Classifier

After fitting my model, I calculated an accuracy score of around 74.11%. This gives me a baseline accuracy to base my other models on to see if I can improve.

This shows the baseline accuracy for the Dummy Classifier model

Part 3: Creating Visualizations

To create visualizations for the dummy tree model, I made a confusion matrix to show the predictions the model made.

This shows the confusion matrix for the dummy classifier model, showing full predictions for no arrest because this was the most frequent
These are the numbers behind the confusion matrix above for the Dummy Classifier Model

Step 6: Applying a Logistic Model

Part 1: Importing & fitting the model

As my predictor variables are categorical and my output variable (arrest vs not arrest) is more categorical in nature rather than continuous numeric, I decided logistic regression was better than linear for the type of analysis I wanted to do. Thus, for my next model, I chose to look at a logistic regression.

To begin, I first imported the LogisticRegression model and then fit it to my data.

This shows the results of fitting my data to a logistic model

Part 2: Evaluating logistic model

To evaluate this model, I created an accuracy score for predicting arrest. I found an accuracy of 85.66%! This was an improvement from the baseline 74.11% gathered from my dummy classifier.

This shows the accuracy of the Logistic Model in predicting arrest

Part 3: Creating Visualizations

To look more into this model, I created a confusion matrix showing the model’s prediction of arrest rate. You can see the largest value is a true negative: a true no arrest rate prediction (260522). The second-largest value was a true positive: a true arrest prediction (51086). There was a significant amount of false negatives (43094). I argue that false positives are worse than false negatives in this business case: if a crime company or the Chicago PD was to predict someone would get arrested, but they did not, it could persuade them to arrest someone that perhaps does not deserve it. In this model, there is a 3.36% false-positive rate of all the cases that were no arrests.

This is the confusion matrix showing predicted arrest and true arrest to show the accuracy of the logistic model
These are the values behind the above confusion matrix

Beyond my model's accuracy, I wanted to look at some more visualizations about how arrest rate changes for different variables. To do this, I created contingency tables and mosaics for the strongest correlating variables with arrest rate: primary type theft, primary type narcotics, and primary type criminal damage.

Contingency Table 1: Theft Crimes vs Arrest

This shows the contingency table for theft arrest rates. You can see a high amount of no arrest for theft compared to arrest.

Contingency Table 2: Narcotics Crimes vs Arrest

This is the contingency table for Narcotics crime with arrest. This shows that for most narcotics crimes, there is an arrest.

Contingency Table 3: Criminal Damage Crimes vs Arrest

This is the contingency table for criminal damage crimes. This shows that for most criminal damage crimes, there is not an arrest.

Step 7: Applying a Decision Tree Model

Part 1: Importing & fitting the model

The next model I wanted to look into was a decision tree. First, I imported DecisionTreeClassifier from sklearn.tree and then set the criterion to Gini and fit my data.

Part 2: Evaluating decision tree model

To evaluate the decision tree model, I found its accuracy score for predicting arrest. I found a slightly higher accuracy (85.763%) than with my logistic model (85.66%). This was also an improvement from the baseline dummy classifier accuracy of 74.11%.

This shows the accuracy found with the decision tree model, slightly higher than logistic

Part 3: Creating Visualizations

To create visualizations for the decision tree model, I started with a confusion matrix. I saw that the most populated space was true negative (261334) and true positive (50633), however, there were a significant amount of false negatives (43547). This was similar to the logistic model. Concerning false positives, there is a slightly lower rate than compared to the Logistic Model: this model has a 3.06% false-positive rate of all non-arrests, where the Logistic Model had a 3.36%

This shows the confusion matrix for the decision tree model, showing the most TP & TN followed by FN.
This shows the numbers behind the above confusion matrix.

The decision tree I initially made was very large and was not realistic for analysis. This eliminated the benefit of the decision tree of being able to see the splitting and nodes.

This is the original tree, which is too complex, large, and potentially overfit.

Part 4: Pruning the Tree

To make this tree more effective for use, I needed to prune it down. To do this, I decided to change max depth. I tried different combinations of depths and found the max depth of 4 to still have great accuracy (83.30%, slightly less than the 85.76% accuracy with the larger tree), but be more realistic for analysis. This led me to believe it was not under-fitted, as the accuracy was still close to the much larger tree.

This shows the decision tree with a max depth of 4. The most important feature was whether or not it was a narcotics crime, which I suspected from finding it to be the most correlated variable with arrest.

Part 5: Finding the most important features

Finally, I wanted to take a look into what were the most important features of this model. I created a dataframe of features and their importance and printed the top 3.

This shows the top three features by importance: primary type narcotics, assault, and community area.

Step 8: Applying a Random Forest Model

Part 1: Importing & fitting model

To begin with the Random Forest Classifier, I imported it and also imported accuracy score, classification report, recall score, precision score, and accuracy score. I then fit my model with 100 random estimators and a random state of 0.

Part 2: Assessing Accuracy

After this, I could fit my model and calculate predict_rf, recall_rf, precision_rf, and accuracy_score.

I found an accuracy score of predicting arrest of 85.762%, which was just slightly lower than the Decision Tree accuracy but higher than the Logistic Model. Further, it was higher than the baseline accuracy of the dummy classifier model (74.11%).

This shows the accuracy of the random forest model of around 85.762%

Part 3: Creating Visualizations

To create visualizations for the random forest model, I wanted to create a confusion matrix to show its accuracy. It shows the most predictions as true negatives (261254), followed by true positives (50709), and then a significant amount of false negatives (43471). This was similar to previous results found with the logistic and decision tree models. Concerning false positives, this model had a slightly larger amount (3.08%) than the decision tree but lower than logistic.

This shows a confusion matrix of the random forest model, showing the most TN, then TP, then FN.
These are the numbers behind the above confusion matrix for the random forest model.

Part 4: Adjusting Features in Random Forest Model

Finally, I wanted to look at the impact of adjusting the number of features in the random forest model.

First, I looked at a large number of max features: 10. With this, I got an accuracy score of 85.762%. This was just slightly higher than the previous forest model.

This shows the accuracy score with 10 max features

Next, I looked at a smaller number of max features to see how accuracy would change. I changed max features to 2. This did not adjust accuracy significantly from 10 features, suggesting the number of features from 10 to 2 is insignificant with the accuracy of this model.

Step 9: Selecting a final model

To select the final model I wanted to look at two main things: accuracy rate and amount of false positives. As a false positive could mean a bias toward false arrest, I want to select the model with the lowest false positive rate and highest accuracy. To help me make this, I wanted to write out some summarizing info.

Model 1: Dummy Classifier Model

Accuracy: 74.11%

False Positive Rate: 0% *Classifies all as non arrests

Model 2: Logistic Regression Model

Accuracy: 85.66%

False Positive Rate: 3.36%

Model 3: Decision Tree Model

Accuracy: 85.763%

False Positive Rate: 3.06%

Model 4: Random Forest Model

Accuracy: 85.762%

False Positive Rate: 3.08%

Overall, I believe the Decision Tree model is the best model for this case. It balances the lowest false positive rate (aside from the Dummy Classifier), with the highest Accuracy. Further, I believe the ability the decision tree model lends of visualizing how nodes split based on feature importance is useful for a user to understand the likelihood of arrest. Overall, I believe this model and its visualizations are the most powerful and useful in this case and would recommend its use.

--

--