Case Study for Data Mining using SQL — Analysis of USA Homicide Reports in USA

Harinath Selvaraj
coding&stuff
Published in
5 min readOct 14, 2018

Hello All! Data Mining is all about finding the tiny details in the data that we’re looking for.

In this article, we’re going to do a case study of Homicide cases that were reported in the years — 1980 to 2014.

Data source can be downloaded at https://www.kaggle.com/murderaccountability/homicide-reports

The Murder Accountability Project is the most complete database of homicides in the United States currently available. This dataset includes murders from the FBI’s Supplementary Homicide Report from 1976 to the present and Freedom of Information Act data on more than 22,000 homicides that were not reported to the Justice Department. This dataset includes the age, race, sex, ethnicity of victims and perpetrators, in addition to the relationship between the victim and perpetrator and weapon used.

The data can be downloaded from the below link —

Step 1 — Download the data and extract the .CSV File

Step 2 — Check the size of each field. It is present in the columns section of the above kaggle link. However it says only as “Numeric” or “String”. We need to check the max length of each values to define the length of each column. We can use python libraries — Numpy and Pandas for the same.

If you have not installed & worked with Python before, here is the link —

Step 3 — Install Numpy and Pandas by executing the below commands from terminal

conda install -c anaconda numpy 
conda install -c anaconda pandas

Step 4 — Read the .CSV file and display first 5 lines of data to check..

import pandas as pd 
ipData = pd.read_csv("database.csv")
ipData.head()

Step 5 — Check the maximum length of the values in each column.

Note: ”ipData” in the below code is the input data variable from above step

measurer = np.vectorize(len)
res = measurer(ipData.values.astype(str)).max(axis=0)
print(res)
#Result
[ 6 7 81 16 31 20 4 9 3 26 3 7 3 29 12 7 2 29 12 20 13 2 2 4]

Step 6 — Now you have obtained the max lengths of each of the fields. The next step is to create the table. Below is the DDL for the same,

CREATE TABLE MURDER_ACCOUNTABILITY(
Record_ID INT(6),
Agency_Code VARCHAR(7),
Agency_Name VARCHAR(81),
Agency_Type VARCHAR(16),
City VARCHAR(31),
State VARCHAR(20),
Year INT(4),
Month VARCHAR(9),
Incident INT(3),
Crime_Type VARCHAR(26),
Crime_Solved VARCHAR(3),
Victim_Sex VARCHAR(7),
Victim_Age INT(3),
Victim_Race VARCHAR(29),
Victim_Ethnicity VARCHAR(12),
Perpetrator_Sex VARCHAR(7),
Perpetrator_Age INT(2),
Perpetrator_Race VARCHAR(29),
Perpetrator_Ethnicity VARCHAR(12),
Relationship VARCHAR(20),
Weapon VARCHAR(13),
Victim_Count INT(2),
Perpetrator_Count INT(2),
Record_Source VARCHAR(4)
)

Step 7 — Load the .CSV data in to table.

LOAD DATA LOCAL INFILE "database.csv"
INTO TABLE MURDER_ACCOUNTABILITY
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;

Note: You might receive an error similar to the below. This error occurred because the value for “Perpetrator_Age” for row — 634667 was space instead of NULL

638454 row(s) affected, 1 warning(s): 1366 Incorrect integer value: ‘ ‘ for column ‘Perpetrator_Age’ at row 634667 Records: 638454 Deleted: 0 Skipped: 0 Warnings: 1

You can verify the same by opening the file in terminal and going to line number 634667

vi database.csv
:634667

Step 8 — Update the corresponding record (from <Space> to NULL in the database

Before that, you may need to create Primary Key so that update happens without error.

alter table MURDER_ACCOUNTABILITY add primary key(Record_ID);

Update script —

update 
MURDER_ACCOUNTABILITY
set
Perpetrator_Age = NULL where
Record_ID = 634667;

Data Mining

Now that we have loaded the data, we need to analyse to find answers for the most common questions that we have on the data.

Question 1

Is there an increase in the Percentage of Unsolved Murders in the current year compared to last year?

Answer

Yes, by 0.179 %

-- YEAR 2013
select
(select count(1) from
MURDER_ACCOUNTABILITY
where
Year = 2013 and Crime_Solved = 'No')
/
(select count(1) from
MURDER_ACCOUNTABILITY
where
Year = 2013)
* 100 as unsolved_crime_percentage
from dual;
# 30.3911 %
-- YEAR 2014select
(select count(1) from
MURDER_ACCOUNTABILITY
where
Year = 2014 and Crime_Solved = 'No')
/
(select count(1) from
MURDER_ACCOUNTABILITY
where
Year = 2014)
* 100 as unsolved_crime_percentage
from dual;
# 30.5701 %

Question 2

Is black people targeted more than White in 2011?

Answer

Yes

select Victim_Race, count(1) from MURDER_ACCOUNTABILITY where
year = 2011
group by Victim_Race;
Victim_Race Count
-----------------------------------------
Black 7246
White 6955
Asian/Pacific Islander 263
Unknown 185
Native American/Alaska Native 107

Question 3

Which is the month in which maximum number of crimes committed by the people related to their profession.

Hint — It needs to have both Employee and Employer for “Relationship” field.

Answer

select month, count(*) as cnt from MURDER_ACCOUNTABILITY where Relationship like 'Employe%'
group by month order by count(*) desc limit 1;
month cnt
-------------------------------
July 96

Question 4

Which known relationship of Victim to the Killer has most crimes ?

Answer

select Relationship, count(1) as cnt from MURDER_ACCOUNTABILITY
where year = 2014 and Relationship not in ('Unknown','Stranger')
group by Relationship
order by count(1) desc limit 1;
Relationship cnt
-------------------------------
Acquaintance 1821

Question 5

Which weapons are less used in 2014 compared to last year?

Answer

select weapon, cnt_2014, cnt_2013 from 
(
select weapon, sum(cnt_2014) as cnt_2014, sum(cnt_2013) as cnt_2013 from
(
select weapon, year, count(1) as cnt_2014, null as cnt_2013 from MURDER_ACCOUNTABILITY
where year = 2014
group by year, weapon
union allselect weapon, year, null as cnt_2014, count(1) as cnt_2013 from MURDER_ACCOUNTABILITY
where year = 2013
group by year, weapon
) as tb
group by weapon
) as tb1
where
cnt_2013 > cnt_2014
weapon cnt_2014 cnt_2013
-------------------------------------
Blunt Object 1291 1292
Fire 80 107
Gun 100 127
Handgun 6594 6733
Poison 10 15
Rifle 350 386
Shotgun 305 348

Question 6

Show Top 3 states which has highest murder crimes.

Answer

select City, count(*) from MURDER_ACCOUNTABILITY
group by City
order by count(*) desc limit 3;
Los Angeles
New York
Cook

Question 7

In which age do people most likely commit crime?

Answer

20 years

select Perpetrator_Age, count(*) from MURDER_ACCOUNTABILITY
group by Perpetrator_Age
order by count(*) desc limit 2;
-- 20 years

Question 8

What years were homicide crimes the highest in?

Answer

1993

select year, count(*) from MURDER_ACCOUNTABILITY
group by year
order by count(*) desc limit 1;

Question 9

Murder crimes are highest during what months of the year?

Answer

July

select month, count(*) from MURDER_ACCOUNTABILITY
group by month
order by count(*) desc limit 1;

Question 10

What are the least agencies which reported crime?

Answer

select Agency_Type, count(*) from MURDER_ACCOUNTABILITY
group by Agency_Type
order by count(*) asc limit 3;
Agency Type Count of Crimes
-----------------------------------
Tribal Police 54
Regional Police 235
Special Police 2889

Thanks for reading! Let me know what you think in the comments. Please give me a clap if you like my post! 😃

--

--