Can machines predict and prevent crimes?

Perhaps better than humans!

Contrary to popular Hollywoodish imaginations, machines are yet to emerge as super-humans that can outsmart, or even match, human intelligence. But there are some arenas in which machine intelligence can support, or at times outwit, the limited neurological ability of human brain. Predictive policing is one domain where ‘machine learning’ can support humans, to make a positive sociological impact.

Predictive Policing refers to gathering loads of data and applying algorithms, to deduce, where and when crimes are most likely to occur. There are a few such softwares being used by police departments, across USA. The same model can be applied to Indian cities also, given sufficient digitization of crime data.

In particular, PredPolused by LAPD and Atlanta Police Department uses place, time and type of crime to create hot spot maps that help police to decide nightly patrol routes. Another ML tool, HunchLab, used by the NYPD & Miami Police Department focuses on social and behavioural analysis to generate predictions. According to LAPD, significant percentage drop in burglaries are reported, after deployment of machine learning tools. Even Sherlock Holmes would have been impressed by the logic, being used in such machine-assisted detective work!

Let’s take a quick peek at the science behind the success of such softwares. In this case study, let’s consider the crime database of North Carolina County in US for analysis and build a simple “linear model” on top, to predict future crimes. As we advance further, we will find out whether such an assumption of linear relationship between features really exists, and if so, what the properties of that relationship are.

What is a linear model?

A linear model assumes linear relationship between dependant and independent variables. Dependant variable is the value we need to predict (here, Crime Rate). When dependant variable is a real value then we use linear regression model and when categorical, logistic regression model. Since crime rate is a real value, we will use linear regression to model the problem.

Linear Regression

Using linear regression, we are looking for a linear equation, y = ax + b, that captures the relationship in the data. Given such a trend line, it is easy to figure out corresponding y-value for a future x-value.

Table Of Contents

— * b) Data Analysis

— —2. Bivariate Analysis
— — — **
Box Plot
— — — **
Violin Plot
— — — **
Linear Regression Fit of Strongly Correlated Features
— — — **
Feature-Feature Correlation Analysis
— — — **
Zoomed HeatMap

—* d) Data Cleaning

— * g) Drawing Conclusions

Applying Linear Regression: Steps

This blog is an attempt to showcase how linear regression model can be used to predict crime rates. We will do a walk-through of the steps below, alongside Python code, for better clarity. Please note, the softwares in real-world, use more complex models and input dataset could be much larger.

The following steps outline the Machine Learning approach to address the problem. We can persist the model [3] after training, which essentially is deployed as ML software, used for prediction later.

b) Data Analysis: To analyse and prepare data for EDA

c) Exploratory Data Analysis (EDA): To visualize distributions and draw correlations between attributes. There are 2 ways of doing EDA:

i) Univariate Analysis: To find out how much a single feature in the dataset would be helpful to determine the target feature, i.e. crime rate.

ii) Bivariate Analysis: To find the relationship between given attributes and crime rate.

d) Data Cleaning: To cleanse the data, based on Data Analysis and EDA, by removing irrelevant and incomplete information. This is the most important step.

e) Model Building: To develop a suitable Linear Model with crime rate as the dependent variable, based on the findings of EDA.

f) Training & Evaluation: To train the model and evaluate prediction performance.

g) Drawing Conclusions: To appraise and suggest improvements to the model.

Data Description

The dataset contains data for crime rate in the state of North Carolina aggregated by county.

Data Attributes

The column attributes in the dataset are defined here:

In the dataset, there are attributes such as conviction probability, police per capita, population density, region, minority percentage, % of young population etc. which could potentially impact crime rate. But let’s evaluate which all attributes are helpful in practice, before building the final model.

Step-by-step Code Walk-through

Sample Output:

b) Data Analysis

The real world data is always far from perfect. We will find anomalies in input data and remove them by following the steps below.

Observation: Last column was getting read as ‘object’ data instead of ‘float’. It was found to be due to a special symbol in input .csv file.

Observation: Maximum value of probability features, prbarr & prbconv, are found to be > 1 which is a data anomaly. We have removed such rows from further analysis.

Observation: As there is no missing value in the input, data imputation is not required. Data imputation is a method of replacing missing values with substitutes.

c) Exploratory Data Analysis (EDA):

1. Univariate Analysis:

We will compare the distribution of target variable with remaining features to identify the predictive power of individual variables.

Distribution of Target Variable

Distribution of All Features

Observation: The features density, mix, police per capita, probability of conviction and tax revenue per capita seems to have similar distribution as crime rate.

Probability/ Cumulative Distribution Function (CDF)

Observations:

a) Strangely, more than 95% of weekly wages of service industry (wser) is found to lie below 400, but the maximum wage is around 2250. Hence, we will remove “county 185” from the input data.

b) Though maximum value of tax revenue per capita is 120, more than 50% of values lies below 40.

c) Though the maximum value of police per capita is 0.009, more than 60% of values lies below 0.001.

Let’s examine further using bivariate analysis.

2. Bivariate Analysis:

Bivariate visualization is performed to find the relationship between each variable in the dataset and the target variable of interest, i.e. crime rate.

Observations:

a) Based on the above pair plot, it can be noted that density feature is most positively correlated with crime rate.

b) Strangely, the weekly wage features and crime rate is found to be slightly positively correlated. This signifies unequal distribution of income or probably high unemployment rate.

Similarly, we can find if there is any correlation among features across location: ‘west’, ‘central’ & ‘urban’.

Tip: As a data scientist, it is always better to know the practical context of the problem under observation. In this case, we should know North Carolina County is bounded by Appalachian Mountains on the west and Atlantic coast on the east. Hence, the frequency & type of crimes would be significantly different between west and east.

To draw the differences, let’s do a box plot & violin plot of crime rate against boolean features ‘west’, ‘central’ & ‘urban’.

Box Plot: Location

Violin Plot: Location

Observations:

a) The crime rate in urban areas is found to be significantly high. Thus, the feature ‘urban’ is useful for prediction.

b) The crime rate in west is found to be less and central, moderate. But as there is significant overlap, such variations may not be very helpful for prediction.

Linear Regression Fit of Strongly Correlated Features

As there are a lot of features, we will take only the most correlated features to estimate the goodness of linear fit.

There are 6 strongly correlated values with Crime Rate:

crmrte 1.000000
density 0.728963
urban 0.615602
wfed 0.486156
taxpc 0.450980
wtrd 0.410106

Feature-Feature Correlation Analysis

Multi-collinearity among features can be identified by doing Feature-Feature correlation analysis. In Linear Regression, the input variables shouldn’t be multi-collinear, i.e. dependent on each other.

Zoomed HeatMap

Observations:

a) The density and urban variable seems to be highly correlated, which is obvious, because urban areas are densely populated. Hence, there is a high chance of multicollinearity between density and urban features. We will use linear regression to sort out this question.

b) “Wage features” across domains are positively correlated. This is also intuitive, as the wage increase or decrease in one domain would certainly influence the other.

The above observations from EDA are carried forward to help model building.

d) Data Cleaning

Before building the model, it is important to clean the data based on observations from Data Analysis & EDA.

e) Model Building

Let’s evaluate the above observations by building Linear Regression Models. In linear models, it is advisable to do standardization, before building the model.

1. Creating Model with Most Correlated Feature

Based on EDA, we know crime rate is most correlated with density

Interim Observations:

• As p-value of density is very less, crime rate is closely related to density
• R-squared value = 0.525 means 52.5% variability of crime rate is explained by density feature.

2. Creating Model with Top 2 Correlated Features

Interim Observations:

• R-Squared value increased to 0.527 when ‘urban’ is coupled with ‘density’ as predictor variables. But, R-Squared always goes up when you add more predictor variables, regardless of whether the added variable help in prediction or not.
• Adjusted R Squared, penalizes for adding more variables. Thus, it would go down when you add variables that doesn’t contribute. Note that, Adjusted R-squared value has gone down from 0.519 to 0.514. Also, the AIC value is increased from -470 to -469. [See notes below]
• Note that p-value of ‘density’ feature also increased slightly from earlier model. Thus, the model has become less reliable to explain crime rate.

Notes:

If we add variables that are not useful for prediction, it would cause ‘Overfitting’. Then, prediction model would perform better with training data but less with real world data. By tracking Adjusted R Squared, p-value, AIC & BIC we can carefully include or exclude variables to model. Another standard method to identify overfitting is to check for divergence of Train and Test Loss curves. [8]

Akaike information criterion (AIC) estimates the relative information lost by a given model: the less information a model loses, higher the quality of the model. Thus, lower the AIC, the better. AIC & BIC (Bayesian information criterion) represents the quality of model in comparison to another.

3. Model with all Features

As we analysed top 2 correlated features, we will add all features to the model and systematically remove features to find best model.

4. Removing Features from All-feature Model

Interim Observations:

a) After removing 2 features ‘urban’, ‘county’, Adj. R-squared improved from 0.825 in all-feature model to 0.830.

b) AIC value decreased from -591.3 in all-feature model to -595.2, after removal of 2 features ‘urban’, ‘county’.

Thus, we have a better model than the all-feature model. We will try to remove more features and analyze model indicators further.

5. Removing more features from all-feature Model:

Interim Observations:

Adj. R-squared & AIC value of the above model is better than earlier model. We will try to remove even more features with p > 0.05 and evaluate using RMSE.

6. Model Evaluation Using Cross Validation & RMSE

We will test the change in RMSE value when the features with p > 0.05 are removed.

Interim Observations:

From the bar chart, the RMSE values performs better when wtrd & avgsen features are removed, along with the previously removed features.

But R-squared and AIC figures degrade when both wtrd & avgsen are removed. Since wtrd has a higher p-value than avgsen, we will remove only wtrd from our model.

7. Building the Final Model

We have identified 9 features to be removed from the dataset. Let’s build the final model.

f) Training & Evaluation of Model

We will split the input data set into train and test. Test data is used to evaluate model performance.

Mean Absolute Error (MAE) = 0.006986915112800922

Median Squared Error (MSE) = 9.631084350128324e-05

Root Mean Squared Error (RMSE) = 0.009813808817237233

Explained Variance = 0.8203755208633289

Median Absolute Error = 0.004534970460507454

g) Drawing Conclusions

a) Note that “Actual Crime Rate” vs “Predicted Crime Rate” plot is linear. This means, crime rate prediction is almost same as actual crime rates. Hence the linear model is working correctly.

b) Feature Engineering:combining location features (west, central and urban) into a categorical feature, or feature-binning of real values using Decision Trees, or Functional-transforms like log could increase prediction accuracy.

c) To improve the model, it is good to engineer features by performing EDA, to identify where the error is large. Eg: distribution and percentiles of error plot.
d) More important crime predictor features like ‘unemployment rate’ should be incorporated to input dataset.

The entire code of the above case study can be found here: