Analyzing Police Activity with pandas 案例實作（二）– 數據整理的脈絡

學．誌｜Chris Kang

Published in

不止數據｜Not Only Data

14 min readNov 15, 2019

接續第一篇的系列文章，這篇就比較接近實際操作文。如果對於撈到數據之後，該如何有效的清理資料感到好奇，就進來看看吧！

這個系列的文章，會希望能夠用比較淺顯易懂而非專業、著重概念而非技術的方式來敘述，因此歡迎有興趣的讀者，能夠花一些時間讀讀本篇的內容喔！

那麼，就來繼續我們對上一篇的分析吧！對於沒有看過上一篇的讀者，簡單介紹一下這個專案：

THE STANFORD OPEN POLICING PROJECT

此次的分析會繼續使用 Rhode Island Area 的開放資料，如果讀者對該專案有興趣，也歡迎到網站裡的 Explore 或 Publication 看看這個專案最新的研究進度喔！

本篇的架構會分成這幾個部分：

回顧假設（Recheck the Hypothesis）
匯入數據與基本檢視（Import Data and Inspect）
清洗數據（Cleaning Data）
數據操縱（Manipulating Data）

裡面的概念其實沒有你以為的這麼難，就讓我們開始動手開始做吧！

一、回顧假設（Recheck the Hypothesis）

這個步驟，主要是用於檢視資料是否能滿足我們的分析。因此延續上一篇的分析假設，主要列出下面兩個方向：

在此事先聲明，此處的假設僅為推論，並非事實。

性別上的差異，會影響是否被攔停、開罰單的比例。
種族的差異，會影響是否被攔停、開罰單。

根據我們會根據的推論方向，加上直覺的判斷，來進行兩個方向的假說設立。直覺認為，女性也許在執法的過程中，會比較容易受到警方的通融，因此在攔停或開罰單的比例也許會低於男性。

另一個推論方向則認為，美國可能存有種族歧視的狀況，因此我們會預期黑人或西班牙裔的司機，被攔停或開罰單的機率會高於白人。

彙整過後，我們可以獲得下列的兩條推論假設：

檢驗性別假設：女性被攔停、開罰單的比例會低於男性。
檢驗種族假設：黑人、西班牙裔族群被攔停與開罰單的比例會高於白人。

二、匯入數據與檢視（Import Data and Inspect）

有了基本假說之後，就可來進行數據匯入與基本的數據檢視了。首先，我們先把數據讀進電腦，並儲存在 ri 這個變數裡。

此處先進行簡單的數據說明：

每一列代表一次 traffic stop
NaN 代表遺失值

# 數據分析用的套件 Pandas
import pandas as pd# 把獲得的檔案讀入 ri 變數裡
ri = pd.read_csv('police.csv') # 簡單檢視前面五列數據，因數據太多此處暫不列出
ri.head()# 顯示該數據欄位與資料結構
ri.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91741 entries, 0 to 91740
Data columns (total 15 columns):
state                 91741 non-null object
stop_date             91741 non-null object
stop_time             91741 non-null object
county_name           0 non-null float64
driver_gender         86536 non-null object
driver_race           86539 non-null object
violation_raw         86539 non-null object
violation             86539 non-null object
search_conducted      91741 non-null bool
search_type           3307 non-null object
stop_outcome          86539 non-null object
is_arrested           86539 non-null object
stop_duration         86539 non-null object
drugs_related_stop    91741 non-null bool
district              91741 non-null object
dtypes: bool(2), float64(1), object(12)
memory usage: 9.3+ MB

這個步驟主要的目的，是用來檢視數據的狀況。可以看到其中有 county_name 和 search_type 的資料有明顯的缺失，那我們在處理資料時就必須特別注意，不能直接使用該欄位的數據，以免造成分析偏頗。

另一個部分則是檢視變數，確認這份資料是否能檢驗假設。其中我們可以看到 driver_race 和 driver_gender 的數據量都足夠，而且 search_conducted 也有足夠的資料，能檢驗警方是否執行攔停。

三、清洗資料（Clean Data）

檢視完後，就要對一開始發現的缺失值進行檢查與清洗，此處我們使用了 isnull 的方法來檢驗資料。

A. 查詢缺失狀況

這個函數就很直觀，如果資料確實是 NULL 就會顯示 True；而 sum() 則是用來統計整組數據中，每個欄位擁有 NULL 的數量。

# 檢驗資料中是否有缺失值，如果只打 isnull 則會把整個表個都印出來，
# 無資料顯示 True，有資料則顯示 False
ri.isnull().sum()state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64# 檢視整個資料的欄列數
ri.shape==> (91741, 15)

B. 丟棄缺失欄位

簡單的檢視缺失值後，我們發現了前面提到的 county_name 欄位僅只有缺失值，因此決定要丟棄這個欄位。

此外在檢視該數據時，知道該數據是來自於同一個洲，因此整個 states 的數據都會來自 Rhode Island，因此也要一併丟棄它。

# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)# 丟棄完兩個欄位後，可以發現數字變少了
ri.shape(91741, 13)

C. 丟棄缺失值

在丟棄掉無用的欄位後，我們發現本身的數據仍然存有缺失值。因此，我們就可以利用 Dropna() 的功能，丟棄掉欄位中的 NULL 值。

其中如果希望只丟棄整列都為 NaN 的數據，則要在參數中加入 dropna(how=’all’)，相反地，就把參數替換成 dropna(how=’any’)。

inplace=True 的功能，則是在確保數據真的會從 ri 這個數據集裡刪除。因為 Pandas 套件本身有軟刪除的功能，避免使用時不小心刪除數據。

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)# 原本的數據
==> print(ri.isnull().sum())    stop_date                 0
    stop_time                 0
    driver_gender          5205
    driver_race            5202
    violation_raw          5202
    violation              5202
    search_conducted          0
    search_type           88434
    stop_outcome           5202
    is_arrested            5202
    stop_duration          5202
    drugs_related_stop        0
    district                  0
    dtype: int64# 經過丟棄後
==> print(ri.isnull().sum())stop_date                 0
    stop_time                 0
    driver_gender             0
    driver_race               0
    violation_raw             0
    violation                 0
    search_conducted          0
    search_type           83229
    stop_outcome              0
    is_arrested               0
    stop_duration             0
    drugs_related_stop        0
    district                  0
    dtype: int64==> print(ri.shape)    (86536, 13)

四、數據操縱（Manipulating Data）

在移除缺失值後，接下來就要來調整資料的狀態，使資料能夠在最合適的狀況下被使用。

首先，我們來檢視目前的資料狀況。可以發現湖了少數資料能自動被判斷為布林值（bool ）外，大部分的資料欄位都只能被預設為物件（Object）。

為了提升我們分析的速度，在開始調整數據欄位前，必須先把資料調整成正確的型態（type）。

stop_date             object
stop_time             object
driver_gender         object
driver_race           object
violation_raw         object
violation             object
search_conducted        bool
search_type           object
stop_outcome          object
is_arrested           object
stop_duration         object
drugs_related_stop      bool
district              object
dtype: object

A. 資料型態調整

可以發現，除了 search_conducted 和 drugs_related_stop 兩個欄位外，其他全部都是物件（object）；但我們要分析的數據可能會有整數（integer）、浮點數（float）甚至還有類別（category）與時間參數（datetimeindex）。

這裡先幫大家簡單介紹下常用的資料格式：

int, float: enables mathematical operations
datetime: enables date-based attributes and methods
category: uses less memory and runs faster
bool: enables logical and mathematical operations

這時，我們就要來調整資料的結構。此處我們使用 astype() 的方式來轉換資料格式。另外補充一點，在左邊要被更動的欄位，一定要使用中括號才能正確運行。

# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')# Check the data type of 'is_arrested' 
print(ri.is_arrested.dtype)

B. 調整時間序列

這個區塊就需要特別拉出來講了，時間序列的資料調整甚至可以寫成好幾篇文章來說明。無論是 resample 時間區間、篩選時間區段等等，都有許多好用的工具，讓我們能鳥瞰與聚焦在整個數據上。

    stop_date stop_time
0  2005-01-04     12:55
1  2005-01-23     23:15
2  2005-02-17     04:15
3  2005-02-20     17:15
4  2005-02-24     01:20

從上面的欄位，我們可以看到總共有攔停的日期與時間。而恰巧 Pandas 提供了方便的解析工具，在我們調整時間區間時能更直覺。

但在開始分析前，我們必須先把兩個欄位給合併在一起，之後再把該欄位轉換成時間的格式。

# 把兩個欄位的字抽取出來，並轉換成 Series 的格式，如果想瞭解 Series 可以參考
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')# 把該 Series 轉換至 datetime 的格式並新增一個 stop_datetime 的欄位
ri['stop_datetime'] = pd.to_datetime(combined)

C. 設定 Index 資料索引

這個步驟，則是要把 datetime 的資料轉換到索引，方便我們進行後續的分析。此處，我們就要使用 set_index 來設定索引的轉換。

# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)# Examine the index
print(ri.index)# Examine the columns
print(ri.columns)INDEX：
DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-12-31 13:23:00', '2015-12-31 18:59:00',
               '2015-12-31 19:13:00', '2015-12-31 20:20:00',
               '2015-12-31 20:50:00', '2015-12-31 21:21:00',
               '2015-12-31 21:59:00', '2015-12-31 22:04:00',
               '2015-12-31 22:09:00', '2015-12-31 22:47:00'],
               dtype='datetime64[ns]', name='stop_datetime', length=86536, freq=None)COLUMNS：
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',         
       'violation_raw', 'violation', 'search_conducted', 
       'search_type','stop_outcome', 'is_arrested', 'stop_duration', 
       'drugs_related_stop', 'district'],
        dtype='object')

之後我們用 ri.index 來檢驗該索引是否已經被正確轉換。發現都已經被正確轉換成 DatetimeIndex 了！

結論

雖然這篇文章看起來有點長，但其實主要只是根據三個觀念加以實現。這三個觀念分別是：

資料確認
資料清洗
資料整理

每一個觀念都還有更多的方式能應用，此處只列出最粗淺的分析方式，以及實做時可能需要參考的資料。如果對於這份數據感興趣，也能到文章開頭的連結下載資料，自己實做看看喔！

希望這篇文章，能夠幫助到拿到一大堆數據，就手足無措的讀者（筆者現在也還是啦xD）。如果有任何的想法或是建議，都歡迎留言跟我討論喔！

【希望用你的掌聲來投票與支持】
拍 5~10 下：簽個到，表示支持（感謝你的鼓勵啊啊啊）
拍 10~20 下：想要我未來多寫「數據技術相關」內容
拍 20~30 下：想要我未來多寫「數據分析實例」內容
拍 30~50 下：我有你這讀者，寫這篇也心滿意足了！