Los Angeles Crime Data Analysis Using Pandas

Fabio Rodrigues
Analytics Vidhya
Published in
7 min readMar 9, 2021
Photo by Luca Micheli on Unplash

During my studies in Data Science using Pandas, I’ve reached a topic with discussions about crime analysis in the metropolis around the world. After some researches to find open data, I’ve decided to explore the data in Los Angeles city. Localized in the south of California, it’s the second more peopled city in the United States (only behind New York), it’s the center of the cinema and television industry.

Pandas it’s one of the most famous libraries for data science in Python, It’s a powerful data analysis with many tools and methods to handle data manipulation. The

About the Dataset

The dataset that will be presented in the next lines is available on the Los Angeles Open Data, for this purpose, I’ve used the dataset “Crime Data from 2020 to Present”, which covers crime incidents in Los Angeles between the years of 2020 and 2021. The original file has 28 columns and 220405 rows. The file has been pre-processed in Jupyter Notebook, to remove some rows values and columns that will be not used in the analysis.

Below are all the variables in the dataset, followed by its description:

  • DR_NO - Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.
  • DATE OCC - Date of crime occurrence(YYYY-MM-DD)
  • AREA - The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
  • AREA NAME - The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for.
  • Rpt Dist No - Code that represents a sub-area within a Geographic Area.
  • Crm Cd - Indicates the crime committed.
  • Crm Cd Desc - Defines the Crime Code provided.
  • Vict Age - Indicates the age of the victim.
  • Vict Sex - F: Female M: Male X: Unknown
  • Vict Descent - Descent Code: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian
  • Premis Cd - The type of structure, vehicle, or location where the crime took place.
  • Premis Desc - Defines the Premise Code provided.
  • Weapon Used Cd - The type of weapon used in the crime.
  • Weapon Desc - Defines the Weapon Used Code provided.
  • LOCATION - Street address of crime incident rounded to the nearest hundred block to maintain anonymity.
  • LAT - Latitude Coordinate.
  • LON - Longitude Coordinate.

Importing Libraries and Creating de Data Frame

Preliminary Analysis

Before deep dive inside the data frame, we need to check his header, shape, variable types, column names, and the percentage of the missing values.

Data Frame Header

Data Frame Volume and Data Types

print('Rows:\t{}'.format(df.shape[0]))
print('Variables:\t{}'.format(df.shape[1]))
df.dtypes
Rows: 220405
Variables: 17
DR_NO int64
DATE OCC datetime64[ns]
AREA int64
AREA NAME object
Rpt Dist No int64
Crm Cd int64
Crm Cd Desc object
Vict Age int64
Vict Sex object
Vict Descent object
Premis Cd float64
Premis Desc object
Weapon Used Cd float64
Weapon Desc object
LOCATION object
LAT float64
LON float64

Missing Values Percentage

(df.isnull().sum()).sort_values(ascending=False) / df.shape[0]Weapon Desc       0.631528
Weapon Used Cd 0.631528
Vict Descent 0.131826
Vict Sex 0.131807
Premis Desc 0.000368
Premis Cd 0.000014
Crm Cd 0.000000
DATE OCC 0.000000
AREA 0.000000
AREA NAME 0.000000
Rpt Dist No 0.000000
LON 0.000000
Crm Cd Desc 0.000000
Vict Age 0.000000
LAT 0.000000
LOCATION 0.000000
DR_NO 0.000000

Missing Data

As we saw in the preliminary analysis, the dataframe contains some blank rows in the columns Weapon Desc, Weapon Used Cd, Vict Descent, Vict Sex, Premis Desc, and Premis Cd. The empty values in the columns Vict Descent, Vict Sex and Premis Desc will be dropped out of the data frame, using dropna command. The other missing values will be filled with 'N/A'. I've opted to fill the values instead to remove them so that some crimes that don't have all the information will not be discarded.

# removing blank values for 'Vict Descent', 'Vict Sex' and 'Premis Desc'df.dropna(subset=['Vict Descent', 'Vict Sex', 'Premis Desc'],inplace=True)# adding the "N/A" text on blank values for 'Weapon Desc' and 'Weapon Used Cd'df.fillna(value='N/A')

Statistical Information About Crime Data in Los Angeles

After removing and cleaning some rows, we can jump to the statistical analysis of the datafram. Using simple functions, is it possible to extract the statistical resume, the most frequent crimes committed, crimes by month, and even graphical data around the information.

# checking the statistical data for each columndf.describe()
Statistical Resume,

Observing the resume above, we can check the count, minimum, maximum, percentual values, means, and standard deviation. The column Vict Age has an error where shows the minimum age for the victim is -1. We can easily remove this value using the function drop by setting a rule to remove specific values. It’s important to point that some location fields with missing data are noted as (0°, 0°), and address fields are only provided to the nearest hundred block to maintain privacy.

# removing the values ​​below one for the "Vict Age" columndf.drop(df[df['Vict Age'] < 1].index, axis=0, inplace=True)

Now, we will split our analysis into two parts. First, we will check the data more focused on people, to get the results about victims by age, sex, and race. After that, the analysis will be focused more on open wide concepts as the most frequent crimes, areas with more incidents, and weapons used.

Victims Analysis

The results regarding victims by age were very close, analyzing the top 10 entries, is it possible to see a range between 25 and 35 years. This result directly reflects the average age of Los Angeles citizens. According to the Census Reporter, people between 20–29 and 30–39 years represent 17 and 16%, respectively.

Results of Victims by Age, image by author
  • 30 Years, about 3.1%
  • 29 Years, about 3.0%
  • 28 Years, about 2.9%
  • 35 Years, about 2.8%
  • 31 Years, about 2.8%

The Vict Sex column presented four different categories in the data frame. According to the Los Angeles Open Data in the column description field, there are only three types: Female (F), Male (M), and Unknown (X). Therefore we will not consider the values shown as H in the results.

Results of Victims by Sex, image by author
  • Male (M), about 51.9%
  • Female (F), about 47.1%
  • Unknown (X), about 1.04%

The city of Los Angeles was founded by Spanish discoverers and was also part of the Mexican territory. After the treaty of Guadalupe Hidalgo, the city and the whole State of California was incorporated as American territory. As a result of the process, and other events like the gold rush and petroleum extraction in California, the city of Los Angeles has inherited an enormous ethnic diversity, all this miscegenation is very evident and makes racial conflicts more intense than in other cities.

Results of Victims by Descent Race, image by author

The results of the graph above exemplify well the diversity present in the city, the percentage values for crimes by ethnicity are as follows:

  • Hispanic/Latin/Mexican (H), about 40.1%
  • White (W), about 26.1%
  • Black (B), about 18.7%
  • Other (O), about 8.97%
  • Other Asian (A), about 2.88%

Crime Analysis

Regarding the crimes committed in the city of Los Angeles, five types have a greater prominence due to the number of occurrences, among the crimes committed between 2020 and 2021 the most frequent are listed in the chart below.

Most Frequent Crimes, image by author

The resuls above, in percentual values are:

  • Battery — Simple Assault, about 10.9%
  • Burglary from Vehicle, about 8.39%
  • Assault With Deadly Weapon, about 7.54%
  • Intimidate Partner, about 7.28%
  • Vandalism, about 6.44%

The Los Angeles Police Department (LAPD) has a division for the police station by communities, where provides general information and assistance, there are 21 geographical areas citywide. Keep in mind the addresses fields aren’t exactly as is, because the LAPD keeps the record to the nearest hundred block to maintain privacy. According to the data frame, the communities where most crimes committed are:

Areas with Most Crimes, image by author
  • 77th Street Area, about 6.76%
  • Southwest Area, about 6.28%
  • Central Area, about 5.97%
  • Pacific Area, about 5.54%
  • Southeast Area, about 5.46%

The laws about the use of guns in Los Angeles allow that you have a gun, but it isn’t possible to port it unless you have a permit for that. California state has one of the most rigorous laws for gun control in the country, due to school incidents and mass shootings the laws get stricter.

Weapons More Used in Crimes, image by author

The graph results, in percentage values, are equal to:

  • Strong-Arm, about 24,9%
  • Unknown Weapon, about 4.54%
  • Verbal Threat, about 3.32%
  • Hand Gun, about 2.43%
  • Knives, about 0.98%

Conclusion

The use of data science allows us to have an open wide view about many themes, analyzing databases of the past and even of the present, it’s possible to extract relevant information to taking decisions. In the specific case of public security, it’s possible to identify the places with the most crime incidents, increase gun control and planning strategies to decrease violence, and bring more security for the local people.

Thanks For Reading!

Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. If you want to check the original file, please use the following link.

LinkedIn: Fábio Rodrigues |Github: fabiodotcom

--

--

Fabio Rodrigues
Analytics Vidhya

Undergraduated in Civil Engineering from PUC Minas, on the Poços de Caldas campus.