Los Angeles Crime Data Analysis Using Pandas
During my studies in Data Science using Pandas, I’ve reached a topic with discussions about crime analysis in the metropolis around the world. After some researches to find open data, I’ve decided to explore the data in Los Angeles city. Localized in the south of California, it’s the second more peopled city in the United States (only behind New York), it’s the center of the cinema and television industry.
Pandas it’s one of the most famous libraries for data science in Python, It’s a powerful data analysis with many tools and methods to handle data manipulation. The
About the Dataset
The dataset that will be presented in the next lines is available on the Los Angeles Open Data, for this purpose, I’ve used the dataset “Crime Data from 2020 to Present”, which covers crime incidents in Los Angeles between the years of 2020 and 2021. The original file has 28 columns and 220405 rows. The file has been pre-processed in Jupyter Notebook, to remove some rows values and columns that will be not used in the analysis.
Below are all the variables in the dataset, followed by its description:
DR_NO
- Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.DATE OCC
- Date of crime occurrence(YYYY-MM-DD)AREA
- The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.AREA NAME
- The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for.Rpt Dist No
- Code that represents a sub-area within a Geographic Area.Crm Cd
- Indicates the crime committed.Crm Cd Desc
- Defines the Crime Code provided.Vict Age
- Indicates the age of the victim.Vict Sex
- F: Female M: Male X: UnknownVict Descent
- Descent Code: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian IndianPremis Cd
- The type of structure, vehicle, or location where the crime took place.Premis Desc
- Defines the Premise Code provided.Weapon Used Cd
- The type of weapon used in the crime.Weapon Desc
- Defines the Weapon Used Code provided.LOCATION
- Street address of crime incident rounded to the nearest hundred block to maintain anonymity.LAT
- Latitude Coordinate.LON
- Longitude Coordinate.
Importing Libraries and Creating de Data Frame
Preliminary Analysis
Before deep dive inside the data frame, we need to check his header, shape, variable types, column names, and the percentage of the missing values.
Data Frame Volume and Data Types
print('Rows:\t{}'.format(df.shape[0]))
print('Variables:\t{}'.format(df.shape[1]))
df.dtypesRows: 220405
Variables: 17DR_NO int64
DATE OCC datetime64[ns]
AREA int64
AREA NAME object
Rpt Dist No int64
Crm Cd int64
Crm Cd Desc object
Vict Age int64
Vict Sex object
Vict Descent object
Premis Cd float64
Premis Desc object
Weapon Used Cd float64
Weapon Desc object
LOCATION object
LAT float64
LON float64
Missing Values Percentage
(df.isnull().sum()).sort_values(ascending=False) / df.shape[0]Weapon Desc 0.631528
Weapon Used Cd 0.631528
Vict Descent 0.131826
Vict Sex 0.131807
Premis Desc 0.000368
Premis Cd 0.000014
Crm Cd 0.000000
DATE OCC 0.000000
AREA 0.000000
AREA NAME 0.000000
Rpt Dist No 0.000000
LON 0.000000
Crm Cd Desc 0.000000
Vict Age 0.000000
LAT 0.000000
LOCATION 0.000000
DR_NO 0.000000
Missing Data
As we saw in the preliminary analysis, the dataframe contains some blank rows in the columns Weapon Desc
, Weapon Used Cd
, Vict Descent
, Vict Sex
, Premis Desc
, and Premis Cd
. The empty values in the columns Vict Descent
, Vict Sex
and Premis Desc
will be dropped out of the data frame, using dropna
command. The other missing values will be filled with 'N/A'. I've opted to fill the values instead to remove them so that some crimes that don't have all the information will not be discarded.
# removing blank values for 'Vict Descent', 'Vict Sex' and 'Premis Desc'df.dropna(subset=['Vict Descent', 'Vict Sex', 'Premis Desc'],inplace=True)# adding the "N/A" text on blank values for 'Weapon Desc' and 'Weapon Used Cd'df.fillna(value='N/A')
Statistical Information About Crime Data in Los Angeles
After removing and cleaning some rows, we can jump to the statistical analysis of the datafram. Using simple functions, is it possible to extract the statistical resume, the most frequent crimes committed, crimes by month, and even graphical data around the information.
# checking the statistical data for each columndf.describe()
Observing the resume above, we can check the count, minimum, maximum, percentual values, means, and standard deviation. The column Vict Age has an error where shows the minimum age for the victim is -1. We can easily remove this value using the function drop by setting a rule to remove specific values. It’s important to point that some location fields with missing data are noted as (0°, 0°), and address fields are only provided to the nearest hundred block to maintain privacy.
# removing the values below one for the "Vict Age" columndf.drop(df[df['Vict Age'] < 1].index, axis=0, inplace=True)
Now, we will split our analysis into two parts. First, we will check the data more focused on people, to get the results about victims by age, sex, and race. After that, the analysis will be focused more on open wide concepts as the most frequent crimes, areas with more incidents, and weapons used.
Victims Analysis
The results regarding victims by age were very close, analyzing the top 10 entries, is it possible to see a range between 25 and 35 years. This result directly reflects the average age of Los Angeles citizens. According to the Census Reporter, people between 20–29 and 30–39 years represent 17 and 16%, respectively.
- 30 Years, about 3.1%
- 29 Years, about 3.0%
- 28 Years, about 2.9%
- 35 Years, about 2.8%
- 31 Years, about 2.8%
The Vict Sex
column presented four different categories in the data frame. According to the Los Angeles Open Data in the column description field, there are only three types: Female (F), Male (M), and Unknown (X). Therefore we will not consider the values shown as H in the results.
- Male (M), about 51.9%
- Female (F), about 47.1%
- Unknown (X), about 1.04%
The city of Los Angeles was founded by Spanish discoverers and was also part of the Mexican territory. After the treaty of Guadalupe Hidalgo, the city and the whole State of California was incorporated as American territory. As a result of the process, and other events like the gold rush and petroleum extraction in California, the city of Los Angeles has inherited an enormous ethnic diversity, all this miscegenation is very evident and makes racial conflicts more intense than in other cities.
The results of the graph above exemplify well the diversity present in the city, the percentage values for crimes by ethnicity are as follows:
- Hispanic/Latin/Mexican (H), about 40.1%
- White (W), about 26.1%
- Black (B), about 18.7%
- Other (O), about 8.97%
- Other Asian (A), about 2.88%
Crime Analysis
Regarding the crimes committed in the city of Los Angeles, five types have a greater prominence due to the number of occurrences, among the crimes committed between 2020 and 2021 the most frequent are listed in the chart below.
The resuls above, in percentual values are:
- Battery — Simple Assault, about 10.9%
- Burglary from Vehicle, about 8.39%
- Assault With Deadly Weapon, about 7.54%
- Intimidate Partner, about 7.28%
- Vandalism, about 6.44%
The Los Angeles Police Department (LAPD) has a division for the police station by communities, where provides general information and assistance, there are 21 geographical areas citywide. Keep in mind the addresses fields aren’t exactly as is, because the LAPD keeps the record to the nearest hundred block to maintain privacy. According to the data frame, the communities where most crimes committed are:
- 77th Street Area, about 6.76%
- Southwest Area, about 6.28%
- Central Area, about 5.97%
- Pacific Area, about 5.54%
- Southeast Area, about 5.46%
The laws about the use of guns in Los Angeles allow that you have a gun, but it isn’t possible to port it unless you have a permit for that. California state has one of the most rigorous laws for gun control in the country, due to school incidents and mass shootings the laws get stricter.
The graph results, in percentage values, are equal to:
- Strong-Arm, about 24,9%
- Unknown Weapon, about 4.54%
- Verbal Threat, about 3.32%
- Hand Gun, about 2.43%
- Knives, about 0.98%
Conclusion
The use of data science allows us to have an open wide view about many themes, analyzing databases of the past and even of the present, it’s possible to extract relevant information to taking decisions. In the specific case of public security, it’s possible to identify the places with the most crime incidents, increase gun control and planning strategies to decrease violence, and bring more security for the local people.
Thanks For Reading!
Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. If you want to check the original file, please use the following link.
LinkedIn: Fábio Rodrigues |Github: fabiodotcom