THE STANFORDOPEN POLICINGPROJECT part 1

Tanabut Taksinavongskul

Published in

botnoi-classroom

7 min readNov 6, 2020

ภาคต่อที่เป็นส่วนของการทำ model machine learning ครับ
“THE STANFORD OPEN POLICING PROJECT Part 2”

Do police bias with black and Hispanic drivers and searched more often than whites?

จริงหรือที่ว่า เมื่อตำรวจทำการเรียกรถให้หยุดในการปฏิบัติหน้าที่ มีโอกาสที่ต้องการจะค้นตัว หรือค้นรถต่อเพื่อหาสิ่งผิดปกติ bias เกิดกับ คนผิวสี และคนที่พูดสเปนหรือคนทางเม็กซิโกนั่นเอง โดยใช้ข้อมูลจาก https://openpolicing.stanford.edu/ โดยจะทำกันโดยเลือก จาก Rhode Island State ครับผม

https://openpolicing.stanford.edu/img/mp.jpg

ใครอยากตามไปดูตัวเต็มตามไปได้ ที่ GitHub หรือ Google colab ตามนี้เลยครับผม ลุย!

GitHub ช่วง prepare: https://gist.github.com/TanabutT/83452915139c2e0069811cd745d1f431#file-policedataprepare-ipynb และ https://gist.github.com/TanabutT/0a2e334ab8ad27bbfae56abea23c2f1c

Google collab ช่วง prepare: https://drive.google.com/file/d/1VwSjFmoqzpWG_X-F8W8lYSrT2dd9I7wS/view?usp=sharing

Google collab ช่วง analysis : https://drive.google.com/file/d/1VwSjFmoqzpWG_X-F8W8lYSrT2dd9I7wS/view?usp=sharing

Problem

Are traffic stops prone to racial bias?

เนื่องจากการปฏิบัติงานของเจ้าหน้าที่ตำรวจในอเมริกานั้น ต้อง interact กับ ผู้คนหลายเชื้อชาติ หลายกลุ่มชาติพันธุ์ ซึ่งทำให้เกิด bias ขึ้นระหว่างทำหน้าที่โดยไม่รู้ตัว dataset นี้เป็นการศึกษา ของ Stanford ที่ต้องการจะ improve interactions between police and the public เราจึงมาวิเคราะห์กันว่าจะได้อะไรกันบ้างจากชุดข้อมูลนี้

คำถามที่จะต้องการหา

คนส่วนใหญ่โดนเรียกด้วยสาเหตุอะไร
คนที่โดนตำรวจเรียก มีใครบ้าง ผู้หญิงผู้ชาย คนผิวขาว ผิวดำ เชื้อชาติอะไรบ้าง กลุ่มไหนเป็นเท่าไร
คนที่โดนตำรวจเรียก โดนค้นตัว ค้นรถ หรือไม่ มีแนวโน้มว่า gender or race ใดไหมที่โดนมากกว่าปกติ หา bias
มีความเชื่อมโยงหรือไม่ระหว่าง race กับการ โดนจับในที่สุด
มีความเชื่อมโยงหรือไม่ระหว่าง race กับการ ค้นหาเจอของผิด กฏหมาย

Examining the dataset

มาวิเคราะห์ดูการเรียกหยุดของตำรวจ ใน Rhode Island State โดยก่อนที่จะเริ่มสิ่งสำคัญคือ ดูข้อมูลผ่านการเรียก ด้วย pandas ดู แถวบนๆ และ คอลัมน์คร่าวๆก่อน ลองดูค่าทั่วไปของข้อมูล ดูว่าข้อมูลขั้นต้นมีอะไรหายไปหรือ มีค่าแปลกๆ หรือไม่

# Read 'police.csv' into a DataFrame named ri
ri_2020 = pd.read_csv('dataset/ri_statewide_2020_04_01.csv', dtype={'frisk_performed':'float'}, low_memory=False)ri_2020.head()

Raw Data Preparation (part1)

ส่วนแรกของ notebook นี้เป็นการนำ ไฟล์ raw data มาจัดการเบื้องต้น เพื่อเตรียมข้อมูลไว้ในการ analyze and visualize ใน part ต่อไปครับ.

รายละเอียดของแต่ละ column

raw_row_number = หมายเลข
date = วันที่ตรวจค้น
time = เวลาที่ตรวจค้น
zone = เขตโซนที่ทำการตรวจค้น
subject_race = เชื้อชาติ (สีผิว)
subject_sex = เพศ
department_id = รหัสแผนก (drop)
type = ประเภทการตรวจค้น (drop)
arrest_made = ทำการจับกุมหรือไม่
citation_issued = ออกใบสั่งหรือไม่
warning_issued = มีตักเตือน
outcome = ออกใบสั่ง/แจ้งเตือน/จับกุมตัว
contraband_found = ตรวจพบสิ่งผิดกฎหมาย
contraband_drugs = ตรวจพบยาผิดกฎหมาย

contraband_weapons = ตรวจพบอาวุธผิดกฎหมาย
contraband_alcohol = ตรวจพบแอลกฮอล์ผิดกฎหมาย
contraband_other = สิ่งอื่นๆที่ผิดกฎหมาย
frisk_performed = การตรวจค้นร่างกาย
search_conducted = การตรวจค้นยานพาหนะ
search_basis = ค้นหาเบื้องต้น
reason_for_search = สาเหตุการตรวจค้น
reason_for_stop = สาเหตุการเรียกหยุด
vehicle_make = ยี่ห้อยานพาหนะ
vehicle_model = รุ่นยานพาหนะ
raw_BasisForStop = เริ่มต้นที่เรียกหยุด
raw_OperatorRace = เชื้อชาติ (สีผิว)
raw_OperatorSex = เพศ
raw_ResultOfStop = สาเหตุที่เรียกหยุด
raw_SearchResultOne = ผลการตรวจครั้งที่ 1
raw_SearchResultTwo = ผลการตรวจครั้งที่ 2
raw_SearchResultThree = ผลการตรวจครั้งที่ 3

มาเช็คดูว่า มี ข้อมูลของเราหายไปเยอะขนาดไหนแบบเป็นภาพ

จากแหล่งที่มาของข้อมูลจะให้ภาพรวมของข้อมูลว่ามีอะไรเป็นประโยชน์บ้างดังนี้

ต่อไปก็เริ่มดูแต่ละ column มีค่าซ้ำหรือค่า unique อะไรบ้าง โดยใช้ ดูตามใน colab ด้านบนได้เลยครับ

for i in ri_2020.columns:
    
        print(ri_2020[i].value_counts())# see rough detail with in column

ต่อมาเราก็จะเลือกเฉพาะ column ที่น่าจะนำมาใช้ประโยชน์ได้ ตรงนี้ เมื่อเราเข้าไปทำ analysis ก็จะรู้ว่าต้องกลับมาเพิ่มลด column ไหนบ้าง column ที่ตัดไปแล้ว ถ้าสุดท้ายต้องการใช้ต้องมาปรับแก้ตรงนี้ ต่อไปทำการ ตั้งชื่อเปลี่ยนชื่อใหม่ให้ง่ายต่อการทำความเข้าใจ จะได้ไม่งง สับสนเอง และ ให้ผู้อื่นมาอ่านได้ไม่ยากเกินไป ดังตัวอย่าง code ข้างล่าง

#cut down the table only interest columnri_trim = ri_2020[['raw_row_number', 'date', 'time','zone', 'subject_race', 'subject_sex',
       'arrest_made', 'citation_issued', 'warning_issued', 'contraband_found', 'contraband_drugs',
       'contraband_weapons', 'contraband_alcohol', 'frisk_performed',
       'search_conducted', 'reason_for_stop']]#rename columnri_trim = ri_trim.rename(columns={"time": "stop_time", "zone": "district"})

สุดท้ายเมื่อได้ตารางที่ต้องการก็จัดการ save เป็น csv file

ri_trim.to_csv('PoliceRI2020.csv')

จบ part1 ของตัว colab notebook ตัวนี้

Data Exploration (Part 2)

เราทำการ read_csv file จาก part ที่แล้วเพื่อทำการวิเคราะห์ โดยอ้างอิงจากคำถาม ที่เราตั้งไว้เพื่อสำรวจข้อมูล

จาก csv file ล่าสุดที่ได้มาเรามี column ทั้งหมดดังนี้

raw_row_number      509681 non-null  int64  
 1   date                509671 non-null  object 
 2   stop_time           509671 non-null  object 
 3   district            509671 non-null  object 
 4   subject_race        480608 non-null  object 
 5   subject_sex         480584 non-null  object 
 6   arrest_made         480608 non-null  object 
 7   citation_issued     480608 non-null  object 
 8   warning_issued      480608 non-null  object 
 9   contraband_found    17762 non-null   object 
 10  contraband_drugs    15988 non-null   object 
 11  contraband_weapons  11795 non-null   object 
 12  contraband_alcohol  1217 non-null    object 
 13  frisk_performed     509681 non-null  float64
 14  search_conducted    509681 non-null  bool   
 15  reason_for_stop

คำอธิบายของข้อมูลแต่ละ column
raw_row_number : เลขที่ของการเรียกหยุดรถ
date : วันที่ตำรวจเรียกหยุดรถ
stop_time : เวลาที่ตำรวจเรียกหยุดรถ
district : บริเวณของการเรียกหยุดรถใน Rhode Island State
subject_race : เชื้อชาติคนขับรถ
subject_sex : เพศของคนขับรถ
arrest_made : ทำการจับกุม
citation_issued : ทำการเขียนใบสั่ง
warning_issued : ทำการตักเตือน
contraband_found : พบเจอการลักลอบของผิดกฏหมาย
contraband_drugs : พบเจอการลักลอบยาเสพติด
contraband_weapons : พบเจอการลักลอบอาวุธผิดกฏหมาย
contraband_alcohol : พบเจอการลักลอบเหล้าผิดกฏหมาย
frisk_performed : ทำการตรวจค้นร่างกาย
search_conducted : ทำการตรวจค้นยานพาหนะ
reason_for_stop : สาเหตุที่ตำรวจเรียกหยุดรถ การฝ่าฝืนกฏจราจล

จาก Colab เราก็ทำการสำรวจข้อมูลเบื้องต้น และทำการ cleaning missing data โดยการ drop row ทั้งหมดที่ไม่ได้มีการบันทึก เพศของคนขับที่ missing ไปในบาง row

ต่อมาทำการเช็ค ต่อไปว่าค่า nan ที่ได้เยอะมากในส่วนของการขนของอย่างผิดกฏหมาย จึงได้ทำการแทนค่า nan ด้วย 0 เพื่อเป็นการลงข้อมูลในส่วนที่ค้นหาของผิดกฏหมายไม่เจอแทนที่ลงไปนั่นเอง

ขั้นต่อไปทำการปรับ dtype ของข้อมูลให้อยู่ในชนิดที่เหมาะสม ทำให้ง่ายต่อการ วิเคราะห์และยังเป็นการช่วยในเรื่องความเร็วของกระประมวลผลของเครื่องคอมพิวเตอร์ด้วย เพราะจะทำให้ใช้ memory และพื้นที่การจัดเก็บน้อยลงไปมาก

#arrest_made is object then should be change to 'bool'ri['district'] = ri.district.astype('category')
ri['driver_race'] = ri.driver_race.astype('category')
ri['driver_gender'] = ri.driver_gender.astype('category')
ri['arrest_made'] = ri.arrest_made.astype('bool')
ri['citation_issued'] = ri.citation_issued.astype('bool')
ri['warning_issued'] = ri.warning_issued.astype('bool')
ri['frisk_performed'] = ri.frisk_performed.astype('bool')ri['reason_for_stop'] = ri.reason_for_stop.astype('category')print(ri.dtypes)
print('\n')

ผลลัพท์จากการใช้ type ได้เหมาะสมดังนี้

previously memory usage: 40+ ish MB 

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 480584 entries, 2005-11-22 11:15:00 to 2015-10-30 11:09:00
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   district            480584 non-null  category
 1   driver_race         480584 non-null  category
 2   driver_gender       480584 non-null  category
 3   arrest_made         480584 non-null  bool    
 4   citation_issued     480584 non-null  bool    
 5   warning_issued      480584 non-null  bool    
 6   contraband_drugs    480584 non-null  bool    
 7   contraband_weapons  480584 non-null  bool    
 8   contraband_alcohol  480584 non-null  bool    
 9   frisk_performed     480584 non-null  bool    
 10  search_conducted    480584 non-null  bool    
 11  reason_for_stop     480584 non-null  category
dtypes: bool(8), category(4)
memory usage: 9.2 MB

จากรายละเดียดดูดีๆจะเห็นว่า memory usage เหลือเพียง 9.2 MB จากเริ่มต้นใช้มากกว่า 40 +MB

ลองมาดูใน Colab ตามนะครับ

Overview for “reason_for_stop”

ภาพรวมของ การที่ตำรวจเรียกหยุดรถคืออะไร? มาดู chart กันครับ

ชัดเจนมากว่า เป็นการฝ่าฝืนใช้ความเร็วเกินกำหนดนั่นเอง.

Exploring the relationship between driver gender and policing

มาดูภาพรวมของ “driver_gender” กันบ้างครับ จากเดิมชื่อ column name “subject_sex” ทำการเปลี่ยนชื่อให้เข้าใจง่ายนะครับ

สัดส่วนของข้อหาการฝ่าฝืนของ female and male

# find out how many male and femele for further investigation
print('male on record {} males'.format(male.shape[0]))
print('female on record :',female.shape[0], '\n')
print('=> male and female proportion is different.')male on record 349446 male
female on record : 131138 female 

=> male and female proportion is different.

Trend ยังเป็น speeding ครับที่มีคนฝ่าฝืนโดยที่ดูๆไปแล้ว โดย chart นี้ข้อมูลมีการ normalized มาแล้วซึ่งจำนวนของตัวอย่างเพศชายนั้นมีทั้งหมด 349,446 คน เพศหญิง 131,138 คน โดย female driver จะฝ่าฝืนเรื่อง speeding เป็นสัดส่วนมากที่สุดในข้อหาต่างๆ ส่วนของ male driver หมายความว่า ถูกแบ่งไปในข้อหาอื่นมากกว่านั่นเอง

Does gender affect whose vehicle is searched?

มาดูกันต่อว่า เพศมีผลต่อการถูกตรวจค้นรถ หลังจากที่ตำรวจเรียกหยุดแล้วหรือไม่

โดยปกติไม่แบ่ง male and female การตรวจค้นรถนั้น มี search rate ที่ 3.7%

# Calculate the search rate by counting the values
print(ri.search_conducted.value_counts(normalize=True))# Calculate the search rate by taking the mean
print(ri.search_conducted.mean())False    0.963041
True     0.036959
Name: search_conducted, dtype: float64

ลองแบ่งตามเพศแล้วดู search rate ใหม่อีกครั้ง จะเห็นว่า male 4.4 percent และ female 1.9 %

# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())driver_gender
female    0.018751
male      0.043792
Name: search_conducted, dtype: float64

compare “search_conduct” rate that consequences from “reason_for_stop” and “driver_gender”

ดูผลของการที่ตำรวจจะทำการตรวจค้นรถต่อจากชนิดการฝ่าฝืนกฏจราจร และลักษณะน่าสงสัยหรือมีพิรุธ โดยแยกตามเพศของคนขับ

ri.groupby(['reason_for_stop', 'driver_gender']).search_conducted.mean().unstack(level=-1)

probability from the type of violation leads to a vehicle search

comparison of probability from the type of violation leads to a vehicle search

For all types of violations, the search rate in males is higher than in females that form above the max search rate in females is about 3.7%. that indicated gender does affect search rate.

อธิบายเพิ่มเติม จากคำศัพท์ตำรวจ
An all-points bulletin (APB) is a broadcast issued from any American or Canadian law enforcement agency to its personnel, or to other law enforcement agencies. It typically contains information about a wanted suspect who is to be arrested or a person of interest, for whom law enforcement officers are to look. They are usually dangerous or missing persons. As used by American police
An all-points bulletin can also be known as a BOLO or BOL, which stands for “be on (the) look-out”. Such an alert may also be called a lookout or ATL (“attempt to locate”).

Does gender affect who is frisked during a search?

frisk คือการค้นตัวครับ แบบที่ตำรวจจะบอกว่า

“หันหลังมือไว้ที่หัวกางขาออกแล้วอยู่นิ่งๆ!”

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk_performed.mean())driver_gender
female    0.437983
male      0.538783

The frisk rate is higher for males than for females around 10 percent,
though we can’t conclude that this difference is caused by the driver’s gender.

Chance of drugs founded with search was conducted

ในแต่ละครั้งของการตรวจค้นรถ มีโอกาศที่จะเกี่ยวข้องกับการเจอยาเสพย์ติด

each search has change that relates with drug 26.8 percent

Chance of illegal contraband founded with search was conducted separated by driver race

ในแต่ละครั้งของการตรวจค้นรถ มีโอกาสที่จะเกี่ยวข้องกับลักลอบขนสิ่งผิดกฏหมายชนิดต่างๆ แยกตาม “driver_race”

The chart above shows no significant connection for specific driver_race to the contraband illegal subject.

What violations are caught in each district?

each district must have a different population size of people but we saw the most stop is “Speeding”.

How “driver_race” influence police activity?

มาดู stop rate ในแต่ละ race ก่อนโดยจะเลือกเปรียบเทียบโดยใช้ ตารางข้อมูล population ของแต่ละ race ใน Rhode Island มาเทียบใน ปี 2015

merge result from the table of the population in 2015 in RI State

In Rhode Island stop rate for black people is higher than the rest race base on population proportion.
Black drivers are stopped at a rate 3.2 times higher than white drivers. Hispanic drivers are stopped at a rate 1.6 times higher than white drivers.

Search rates and frisk rate by each race

หลังจากที่ตำรวจเรียกให้หยุดแล้ว rate ในการจะทำการตรวจค้นรถ และ ค้นตัวคนขับของแต่ละ “driver_race”

search and frisk and rate by driver race

Here we see that among drivers who were stopped, black drivers were searched at a rate of 2.19 times higher than white drivers, and Hispanic drivers were searched at a rate of 1.92 times higher than white drivers. Black drivers were frisked at a rate of 1.16 times higher than white drivers were, and Hispanic drivers were frisked at a rate of 1.02 times higher than white drivers were.

สรุปได้ว่า คนผิวสี และคนละติน เม็กซิกัน นั้น ถูก bias ในการปฏิบัติหน้าที่ของตำรวจและตอบคำถามหลายๆ ข้อจากด้านบนได้ จาก finding ที่เจอในแต่ละช่วงที่ผู้อ่านทุกท่านอ่านมานะครับ

Happy Analysing !

จบแล้วคร้าบ ขอบคุณที่ติดตามอ่านมาจนจบนะคร้าบบบบ

ขั้นต่อไป เราจะนำข้อมูลเหล่านี้ไปใช้ใน model อะไรได้บ้างใน machine learning แล้วกลับมาพบกันใหม่ครับผม

อ่านต่อที่ภาคสอง ตามกันมาเลยครับ

“THE STANFORD OPEN POLICING PROJECT Part 2”