Dissecting Crime in Chicago

Corey Scamman
Bucknell AI & CogSci
6 min readNov 17, 2019

--

By Corey Scamman, Dempsey Wade, and Sam Zimmerman

For our midterm project in Intro to Artificial Intelligence (CSCI 379), our group chose to explore how AI can help analyze and predict crime and crime rates in Chicago based off years of the cities’ police records. We also intend to determine if there are any correlations between variables within the data set. An example of this would be to determine how location affects chance of domestic violence and the probability that an arrest is made. We use a combination of association rule mining, statistical analysis, and neural networks to accomplish this.

Our data set for this project is the Chicago PD police records from 2001 to 2017 and contains information of each arrest. Some of the relevant information in these records are case number, date, type of crime, community area, if an arrest is made, and if the crime involves domestic matters. The entire data set is around 2GBs of data and has hundreds of thousands of entries. Since our data has some irrelevant variables and missing information, we first had to clean our data set so that it would be compatible with the different types of software we were using. Once that was finished, we were able to move forward with our implementation.

Our first approach was to use Frequent Pattern Mining. By using Tensorflow and apriori, we were able to dive deeper into the data set. We were very curious to find which attributes were closely related to each other. In our analysis, we grouped the data set by different variables to find the support for each attribute within. For example, when we grouped our data set by Domestic = True, we found that there was a 20% arrest rate, and 61% of Domestic incidents included battery. Similarly, when we grouped by Arrest = True, we found a support of 25% for theft, 23% for battery, and 19% for criminal damage, as our three highest attributes. We then used frequent pattern mining to find the highest arrest rate for attributes. We found that Arson had an arrest rate of 90%, which follows logically that a crime is not likely to be classified as arson unless there is a motive. We also found that narcotics had a 37% arrest rate, almost double what the average arrest rate was.

The visual below is one of the results we got from our frequent pattern mining. This result is when we used “Arrest = False”. 33 is the numerical label for theft and support indicates that 25% arrests not made were theft. 2 refers to battery and 6 is criminal damage.

Our second approach was to create a neural network with Keras and a TensorFlow backend. The goal was to use the total number of crimes each year to predict the number of crimes in 2020. In order to make a prediction, we decided to use the previous five years, to guess the number of crimes in the next year. To model this decision, there are five nodes in the input layers to our neural network. This propagates down to one output node, which will be the resulting prediction for the number of crimes in the next year. One issue relating to our data here is the fact that there is not a complete record for 2019. Since this number is not accurate, it could have a significant impact on the prediction for 2020.

Our final approach was statistical analysis. We made histograms of crime by location and arrest, calculated relevant statistics, and performed a chi-square association test. To accommodate our statistical software, we only used crimes from 2012–2017 and had to further trim the dataset to have matching variable lengths. We found that a significant amount of crimes took place in community 25 compared to other areas. Chicago can be split into 99 unique community areas and each refers to a different area in Chicago. 25 refers to West Side Chicago which has historical socio economic issues as well as gang violence and shootings. West Side Chicago has a low amount of domestic cases comparatively to the amount of total crimes which supports our earlier claim that the area is mainly filled with gang violence. Of all the crimes in West Side Chicago, only a bit over 50% of the time is an arrest made, with an even lower percentage in other communities. Out of roughly 100,000 records in all of Chicago, only 27.2% of the time is an arrest made. This value varies dependent on the crime committed as some crimes have higher and lower arrest rates. The most common type of crimes committed are theft and battery. Below is a chi square association test to observe the relation between type of crime and whether an arrest was made or not. According to this data, an arrest is only made 10% of time during theft and 18% of the time during battery.

Our results and analysis can be helpful to police departments, but we have to consider ethical and societal implications of such data and algorithms. Computers and technology have already advanced the capabilities of the police in regards to data collection. Crimes back in the day often went unrecorded or just on paper whereas nowadays all crimes are logged into the computer system following people their entire life. This advancement has discouraged crime as people will likely not be able to remove any incidents from their record. With technology advancing and the idea of algorithms that can predict future crimes, we have to think of the ethical and societal implications that come along with this. Our data set is inherently bias as it only tracks successful arrests and reported crimes with unsuccessful arrests. There are an unknowable amount of crimes that have happened that are left out of the data set which skews the data set away from reality. Additionally, since algorithms learn from the data we train them on, it is a safe assumption that these algorithms will learn the biases from the data. For example, a large portion of criminals are typically low income African American and Hispanic males. There are many pervasive socio-economic reasons as to why this is but that isn’t the point here. Any crime prediction algorithm we have will be fed data full of records of low income African American and Hispanic males. Considering how these algorithms learn, the algorithms will then begin to predict that African American and Hispanic males will commit crimes in the future. The issue with this is that technology will begin to support society’s stereotypes of these people furthering racial oppression. Furthermore, since technology often is seen as “correct” since it stems from science and math, people will likely not question the output of this biased crime prediction algorithm. These biases in our data and subsequently in our algorithm could result in unjust arrests and the oppression of a certain demographic.

Should we work on this in the future, we would like to expand on the techniques we have used and refine and add to the work we have already. We would like to improve the accuracy of our neural network and additional prediction capabilities such as using time and location to predict the type of crime.

References

Brownlee, J. (2019, October 3). Your First Deep Learning Project in Python with Keras Step-By-Step. Retrieved November 1, 2019, from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/.

Remanan, S. (2018, November 2). Association Rule Mining. Retrieved October 25, 2019, from https://towardsdatascience.com/association-rule-mining-be4122fc1793.

Raschka, S. (n.d.). Retrieved November 1, 2019, from http://rasbt.github.io/mlxtend/.

Our Data Set

Currie32. (2017, January 28). Crimes in Chicago. Retrieved from https://www.kaggle.com/currie32/crimes-in-chicago/data.

--

--