Modeling Chicago Crime Data Set

Johana Luna
Oct 3, 2019 · 7 min read

This blog post intents to explore crime data in Chicago and showcase the implementation of a predictive model for arrests in Chicago. This could help the public institutions in 3 main ways:

  1. Better create public policy for correctional agencies
  2. Help focus the countermeasures on negatively impacted crime categories according to the prediction
  3. Guide the resource allocation by crime categories

Understanding Crime in Chicago

Chicago, the nation’s third-biggest city, accounted for 22% of the nationwide increase with 749 murders (see right chart below) in 2016, more than the number of murders in the largest city, New York (334), and the second-largest, Los Angeles (294) for the same year, combined. The estimated number of homicides in Chicago increased by 52% in 2016.

The number of homicides rose by 8.6% in the United States (1), making Chicago an outlier, and an interesting case to analyze. The vast majority of these killings happened in five mostly black and Latino neighborhoods on the south and west side where only 9% of the 2.7m city lives (2)



  • The data was extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
  • For the analysis I selected 2015–2017 data.
  • This data set contains 539,814 observations and 23 features


I focused on 12 key features for my analyses:

  1. Case Number — The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
  2. Date — Date when the incident occurred.
  3. Primary Type — The primary description of the IUCR code.
  4. Arrest — Indicates whether an arrest was made.
  5. Domestic — Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  6. Beat — Indicates the beat where the incident occurred. A beat is the smallest police geographic area — each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at(3)
  7. District — Indicates the police district where the incident occurred. See the districts at(4)
  8. Ward — The ward (City Council district) where the incident occurred. See the wards at(5)
  9. FBI Code — Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at(6)
  10. Year — Year the incident occurred.
  11. Latitude — The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  12. Longitude — The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.


Feature Engineering:

I started with feature engineering the data set. The approach to clean up the data had 7 steps:

  1. Check how many missing values there are in each feature. If the missing values are less than 10% of total values in the feature was dropped
  2. Check the feature type and correct it if necessary
  3. Drop duplicate rows
  4. Check for location outliers and eliminate them if needed
  5. Replace False and True by zeros and ones
  6. Create new features by extracting the month, day and hour from the ‘Date’ column
  7. Drop features such as ‘ID’ and ‘UpUpdated On’ because they don’t have any relevant information for the analyses and ‘Date’ to avoid duplication (See step n6)

Target selection:

I defined three things that would be interesting to predict with this data:

  1. The ward where a crime will happen
  2. The type of crime (Column ‘Primary Type’)
  3. If a crime will end up in an arrest

Due to the high cardinality of Ward and Primary Type (see table below), I decided to use ‘Arrest’ feature as the target

Data Split:

The process of splitting the data set was done in two steps:

  1. Extract the X_features and the y_target from my data frame
  2. Split the data using train_test_split from Scikit Learn

Baseline definition:

The objective of a baseline is to create an initial prediction and calculate an accuracy percentage. This will be the benchmark to beat with the future predictive model. In this case, I used the mode as the prediction because my target is categorical.

Results: The accuracy of baseline is: 77.19%

My ROC curve tells me this model has no discrimination capacity to distinguish between positive class and negative class

Model Selection:

In this process I used three relevant methods and compared their results in order to choose the best method to predict with greater accuracy

  1. Logistic Regression- Used when the dependent variable (target) is categorical
  • Accuracy for the validation data = 0.7722
  • Accuracy for the test data = 0.7708

2. XGBoost- A decision-tree-based Machine Learning algorithm that GBT build trees one at a time, where each new tree helps to correct errors made by previously trained tree

  • Accuracy for the validation data = 0.8833
  • Accuracy for the test data = 0.8831

3. RandomForestClassifier- RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data*5

  • Accuracy for the validation data = 0.8891
  • Accuracy for the test data = 0.8904

Selection: Random Forest Classifier had the best performance with an 89% accuracy

Confusion for Random forest Classifier

The number of correct and incorrect predictions are summarized with count values and broken down by each class

As the table below shows, it predicted 72,840 (sum of the diagonal) out of the 81,660 predictions for the validation set correctly, resulting in an 89% accuracy

Feature Importances

Checked the feature importances using the following methods

  • Dropping Manually: The tree based feature importance ranks the features
  • Using Eli5: Provides a way to compute feature importance for any black-box estimator by measuring how score decreases when a feature is not available *6

Sample Explanation:

SHAP (SHapley Additive exPlanations) is a method to explain individual predictions.

“SHAP values attribute to each feature the change in the expected model prediction when conditioning on that feature. They explain how to get from the base value that would be predicted if we did not know any features to the current output. This diagram shows a single ordering. When the model is non-linear or the input features are not independent, however, the order in which features are added to the expectation matters, and the SHAP values arise from averaging the values across all possible orderings”(7)

Top 3 reasons for prediction:

  1. IUCR is 1,320
  2. Description is ‘TO VEHICLE’
  3. FBI CODE is 14

Partial Dependence:

Primary Type codes:

1) Assaults

2) Battery

3) Criminal Damage

5) Theft

6) Burglary

8) Narcotics

10) Deceptive Practive

30) Non-Criminal

PDP shows the probability and the interaction of Primary Type and Domestic features in the arrest. The plot shows the increase in an arrest probability when the Primary Type is Narcotics. On the other hand, crimes such as criminal damage, theft and burglary have a lower predicted arrest.

To clarify, the PDP shows a correlation between the features and the target and don’t pretend to explain causality.

Documentation and NoteBook here

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Johana Luna

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem