Investigating Racial Biases in Seattle Terry Stop Data with Machine Learning

Can the race of a stopped individual be predicted by officer & subject demographic information?

Bryan Dickinson

Published in

The Startup

9 min readMay 25, 2020

Background

There has been a rise in tension between law enforcement and the public. I’ve seen & read many news articles with a lot of rhetoric. As a new Seattleite, I was interested in determining if there was a racial bias with law enforcement and race in the city of Seattle. I decided to to look into Terry Stops performed in the City of Seattle, and come to my own conclusion.

Through Seattle’s Open Data Program, I was able to download the Terry Stop Data from the past few years, perform statistical and exploratory analysis on the data, with the goal of creating a model that would predict the race of a subject that had been stopped by an officer. Can this prediction be made only by the demographic information of both the officer and subject?

Key Findings

White residents are the most stopped group at 51% of Terry Stops, and are 66% of Seattle’s population
Black residents are the second most stopped group, accounting for 31% of stops, and are 8% of Seattle’s population
Black & Multi-Racial residents had the highest percentage of the juvenile (1–17) population stopped at 8% & 7.7% respectivley(p<.05)
White residents are stopped the most, however, they have the lowest frisk rate among all races at 18% (p<.05)
While there is significant findings in a difference in Terry Stops & Frisks among races, the models were not able to accurately predict the race of a subject (or frisk) with only officer demographic data.

Overview

What is a Terry Stop?

Under the 1968 Terry v. Ohio ruling , a police officer may stop and detain a person based on reasonable suspicion . And, if the police reasonably suspect the person is armed and dangerous, they may also frisk him or her for weapons. A stop is justified if the suspect is exhibiting any combination of the following behaviors:

Appears not to fit the time or place.
Matches the description on a “Wanted” flyer.
Acts strangely, or is emotional, angry, fearful, or intoxicated.
Loitering, or looking for something.
Running away or engaging in furtive movements.
Present in a crime scene area.
Present in a high-crime area (not sufficient by itself or with loitering).

What is a Frisk?

A frisk is a type of search that requires a lawful stop. It involves contact or patting of the person’s outer clothing to detect if a concealed weapon is being carried. The frisk doesn’t necessarily always follow a stop. The law of frisk is based on the “experienced police officer” standard whereby an officer’s experience makes him more equipped to read into criminal behavior than the average layperson. The purpose of a frisk is to dispel suspicions of danger to the officer and other persons. The frisk should only be used to detect concealed weapons or contraband. If other evidence, such as a suspected drug container, can be felt under the suspect’s clothing, it can be seized by the officer.

A frisk is justified under the following circumstances:

Concern for the safety of the officer or of others.
Suspicion the suspect is armed and dangerous.
Suspicion the suspect is about to commit a crime where a weapon is commonly used.
Officer is alone and backup has not arrived.
Number of suspects and their physical size.
Behavior, emotional state, and/or look of suspects.
Suspect gave evasive answers during the initial stop.
Time of day and/or geographical surroundings (not sufficient by themselves to justify frisk).

A stop requires Reasonable Suspicsion, a set of factual circumstances that would lead a reasonable police officer to believe criminal activity is occurring. A Frisk is based on the ‘experienced policed officer’. A few of the above behaviors can be viewed as subjective.

Why does it matter?

There is a difference between one police officer stopping one individual, which is a tactical definition, and systematic promotion of this tactic on either the departmental or municipal level, which can damage police–community trust and lead to charges of racial profiling. — Wikipedia Terry_stop

The goal is to investigate the Terry Stops in Seattle by race of the subject and officer, and gain insight if the percentage of stops match the demographic of the city. Also, to help identify if certain officers have a higher prevalence of stops of a certain race.

All of the data, notebooks & reports can be found in this github repository.

Data Wrangling

The Terry Stop Data was obtained and imported from Seattle’s Open Data Program and contained 34,521 officer stops capturing 23 features (‘Subject Age Group’, ‘Subject ID’, ‘GO / SC Num’, ‘Terry Stop ID’,
‘Stop Resolution’, ‘Weapon Type’, ‘Officer ID’, ‘Officer YOB’,
‘Officer Gender’, ‘Officer Race’, ‘Subject Perceived Race’,
‘Subject Perceived Gender’, ‘Reported Date’, ‘Reported Time’,
‘Initial Call Type’, ‘Final Call Type’, ‘Call Type’, ‘Officer Squad’,
‘Arrest Flag’, ‘Frisk Flag’, ‘Precinct’, ‘Sector’, ‘Beat’) from October 1, 2015 to May 5, 2019.

To transform the data into a useable format the following steps were taken:

Drop unnecessary columns: mostly data pertaining to demographic information of both subject & officer
Convert column types & rename: most features were categorical, column names were also adjusted for easier data manipulation.
Manage missing values: records with missing crucial information (ie. subject_race) were dropped
Transform data: additional tranformation of data needed for easier analysis & use by machine learning algorithms. For example changing transforming data in the ‘Officer YOB’ column to the age of the officer at time of the stop or changing a ‘flag’ type column to a boolean type.
Check for invalid data: Dropped records where officers were >100 years old or negative Officer Id’s.

Exploratory Data Analysis / EDA

Before jumping into creating a model — lets take a look at the data. Only a portion of the EDA is reviewed here — see the complete information in the github repository.

1 | What is the racial comparison of stops?

We see that between 2015–2019 Black people account for 31% of total Terry Stops. When compared to the the 2010 census population data, all groups with the exception of ‘Other’ and Blacks are stopped at lesser percentage point than the population.

2 | Is there a gender difference among races?

Across all races, most stops were male subjects.

3| What is the arrest & frisk rates by race?

4| Is there a difference in racial comparison of stops by race of the officer?

White officers will not show much difference from the mean due making up 78% of the officers making Terry Stops. Each race largely follow the average of percentage of stops for each race. (see the github notebook for graphs for all races) Asian officers stop White subjects 7% more and Native American officers stop Black subjects 13% more and White subjects 14% less.

The below chart displays the same chart with White officer stop proportions vs. Non-White Officers.

There are additional interactive Bokeh plots in the original notebook when ran locally. You can view similar charts via this public Tableau dashboard.

Creating the model

There were two types of models that were created.

The multi-class classification of predicting the stopped subject’s race.
The binary classification of predicting if a stopped subject will be frisked.

Data Wrangling cont.

Some additional data wrangling needed to occur to prepare the data for the machine learning pipeline.

subject_age was a categorical feature listed as ranges ie. ‘26–35'. This number was split and averaged
inital_call_type was a feature that described how the how the call initiated. The data was limited to ‘onview’ types of observations. Onview is when the stop (or incident) was observed by the officer (vs. a response to dispatch).
categorical features some features were categorical in nature and needed to be translated into dummy variables
‘beat’ feature NaNs were removed. Because these values describe the location of the incident, it did not make sense to keep or change these NaN values.
subject_race the target variable was converted to category codes then converted to dummy variables for the classifier. Dummy variables were created due to the number of classes. I wanted to ensure there was no value associated with the ‘value’ of the number prescribed to it from the category codes designation.

Model selection

Because this is a multi class classification problem, the classifiers chosen and reasonging were:

Logistic Regression — accuracy & fast training time, probabalistic interpretation
Random Forest Classifier — multi-class accuracy & fast training time
KNeighbors Classifier — one hyper parameter, very easy to implement for multi-class
One Vs. Rest Classifier was chosen because of the multi class scenario. I wanted a set of probabilities for each race.

Preprocessing pipelines

Imbalanced dataset of target variable ‘subject_race’

Synthetic Minority Over-sampling Technique (SMOTE) was utilized due to imbalanced data over the 6 classes. SMOTE was utilized to resample the minority classes during the pipeline.
imbalaned-Pipeline was utilized because of the imbalanced dataset. This allowed for resampling during the pipeline process
Standard Scaler was utilized to scale the data
GridSearchCV was utilized to perform a 5 fold cross validation over the selected parameters for each classifier.
Test-Train-Split was utilized to set aside the test set, with a 20% test size.

Performance metrics

Log loss was utilized for the multi class problem. This metric is better utilized than accuracy for the imbalance seen in the data. Along with evaluation of precision, recall & accuracy from the confusion matrix & classification report, to help visualize & interpret the performance of the data
Accuracy score was utilized for the binary model.

Findings

Subject Race Prediction
Log Loss: .426
Log-Loss is used as the performance metric for this model. It takes into account he uncertainty of the prediction based on how much the prediction varies from the actual label.

The uninformative value for log loss for a balanced dataset with six classes 1.79. The log loss for this model is below the value of 1.79, however the model does not explain the variance well. The predictions observed were well below the threshold to confidently predict any one class. For all classes, a 0 was predicted, confirming the low confidence in each class and resulting in the low accuracy score.

Predicted probabilities (left) actual labels (right). Prediction probabilities were extremely low.

This leads me be to beleive that more information is needed to predict the race of a stopped subject than only the demographic information & stop location.

This model lumps all officers into the same model, deciding on the demographic information of the officers & subjects only. However one person can have a certain bias, while a person within the same race may not have that same bias.

Frisk Prediction
Recall a Frisk is based on the experienced policed officer. From the Exploratory Data Analysis, we found that Subjects stopped of White race were ‘frisked ‘ at a much lower rate than all the other races. Can we predict if a frisk will happen once an individual is stopped based on officer demographic data and location?

Model Accuracy: .796
Precision: .44
Recall: .31

These values suggesst that the classifier is not accuratley predicing a Frisk given the data. The majority of values predicted are of the negative class.

The top feature is officer_id, underscoring that officers use their set of circumstances to determine a frisk, though more information is needed aside from subject demographics.

Eventhough there is a significant difference among races in terry stops and frisk, more analysis is needed to discover if there is any bias present. For instance

A cluster analysis can be performed to investigate if any type of ‘groups’ of officers emerge.
A time series approach, identifying stops or frisks of a class of subjects during certain times of day
Capturing and analyzing the circumstances used for a stop/frisk

Open to insights on improving this analysis!