Python data story telling.

Ed Dudeiko
Nov 1 · 5 min read

In my case I used an open source data from Miami-Dade County Open Data Hub.

I chose the Sexual predator Dataset.

import pandas as pdsp = pd.read_csv('Sexual_Predator.csv')print(sp.shape)
sp.head()

As you can see from the sp.shape above I had 1333 raws and 35 columns. Quite a few of the columns have a lot of NaN values.

I’ve dropped the columns with more than 95% of NaN values and with personal information like SNAME, MNAME

sp = sp.drop(columns=['FNAME', 'MNAME', 'LNAME', 'SNAME',         'ADDR_TYPD',   'ADDRESS2','IMAGE_ID', 'ADDRESS', 'DOC_NBR', 'X_COORD', 'Y_COORD', 'ZIP4', ])

Another way to drop the multiple columns

TRAN_cols = [col for col in sp.columns if col.startswith('TRAN_')]
sp = sp.drop(columns=TRAN_cols)
sp.shape
(1333, 15) # Final datasetsp.sample(3)

Now I can start to prepare the data for the analysis. I decided to have a closer look at the DOB, HEIGHT and WEIGHT columns. I wanted to see if there is any correlation between Predator and Offender in SUB_TYPD. I started by removing ‘lbs’ from WEIGHT column and transforming it from string to integer. Also I renamed the column to WEIGHT(lbs).

sp.rename(columns={'WEIGHT':'WEIGHT(lbs)'}, inplace=True)def remove_lbs_to_int(string):
return int(string.strip('lbs'))

Let’s check how it works

remove_lbs_to_int('190 lbs')
190
sp['WEIGHT(lbs)'] = sp['WEIGHT(lbs)'].apply(remove_lbs_to_int)

The HEIGHT column now. I need to remove and and then transform it to the float for the further analisys. I renamed the column to HEIGHT(f) as well.

sp['HEIGHT'] = (sp['HEIGHT'].str.replace("'",''))
sp['HEIGHT'] = sp['HEIGHT'].str.replace('[: ]','.')
sp['HEIGHT'] = sp['HEIGHT'].astype(float)
sp.rename(columns={'HEIGHT':'HEIGHT(f)'}, inplace=True)sp.dtypes

I changed DOB column to datetime format and left only Year, removing month and date.

sp['DOB'] = pd.to_datetime(sp['DOB'], infer_datetime_format=True)
sp['DOB'] = sp['DOB'].dt.year

Now I can have a look at the value_counts

sp['DOB'].value_counts()sp['HEIGHT(f)'].value_counts()sp['WEIGHT(lbs)'].value_counts()

Separate the data in to two data sets

sp['SUB_TYPD'].value_counts()
sp_offender = sp[sp['SUB_TYPD'] != 'Predator']
sp_predator = sp[sp['SUB_TYPD'] == 'Predator']
sp_offender.shape, sp_predator.shape
((1162, 15), (171, 15))

I used describe to have a a look at the two data sets.

sp_predator.describe(exclude='number')
import matplotlib.pyplot as plt
import seaborn as sns

Let’s have a look at the graphics of the (sp) joint data first.

# matplotlib histogram
plt.hist(sp[‘DOB’], color = ‘blue’, edgecolor = ‘black’,
bins = int(180/5))
# seaborn histogram
sns.distplot(sp[‘DOB’], hist=True, kde=False,
bins=int(180/5), color = ‘blue’,
hist_kws={‘edgecolor’:’black’})
# add labels
plt.title(‘DOB Predator & Offender’)
plt.xlabel(‘Year’)
plt.ylabel(‘Total’)
plt.grid(True)

I can run T-test on the two data sets now.

# List of two sub_types to plot
sex_predators = ['Predator', 'Offender']
# Iterate through the two sub_types
for sex_predator in sex_predators:
# Subset to the SUB_TYPE
subset = sp[sp['SUB_TYPD'] == sex_predator]
# Draw the density plot
sns.distplot(subset['DOB'], hist = False, kde = True,
kde_kws = {'linewidth': 3},
label = sex_predator)
# Plot formatting
plt.legend(prop={'size': 16}, title = 'DOB')
plt.title('Density Plot with Predator & Offender')
plt.xlabel('Year of DOB')
plt.ylabel('Density')
plt.grid(True)
# we can reject the null hypothesis of equal averages for the DOB with the P.value<0.01ttest_ind(sp_offender['DOB'], sp_predator['DOB'])
Ttest_indResult(statistic=2.924801560184627, pvalue=0.003505144185552153)
# we can NOT reject the null hypothesis of equal averages for their weight P.value>0.01ttest_ind(sp_offender['WEIGHT(lbs)'], sp_predator['WEIGHT(lbs)'])
Ttest_indResult(statistic=-0.10689820042837661, pvalue=0.9148858504526294)
# we can reject the null hypothesis of equal averages for the height with the P.value<0.01ttest_ind(sp_offender['HEIGHT(f)'], sp_predator['HEIGHT(f)'])
Ttest_indResult(statistic=2.663515468741881, pvalue=0.007826203833102494)

With the help of Chi square test I explored categorical features of the data.

categorical_features = [sp.STATUS, sp.RACE_TYPD, sp.SEX, sp.EYE_TYPD, sp.HAIR_TYPD]crosstabs = [pd.crosstab(sp['SUB_TYPD'], feature) for feature in categorical_features]crosstabs[0]
from scipy.stats import chi2_contingencyfor crosstab in crosstabs:
print(crosstab)
chi2 = (chi2_contingency(crosstab, correction=False)) # No Yates correctionprint('Chi-square statistic: {}'.format(chi2[0]))
print('p-value: {}'.format(chi2[1]))
print('\n\n')

Chi square confirmed that the column HAIR_TYPD has a significant difference between SUB_TYPD (offender & predator)

I can use folium to see the last known predator’s and/or offenders’s location

import foliumlocations_list = list(zip(sp_predator['Y'], sp_predator['X']))
len(locations_list)
171folium_map = folium.Map(location=[25.761681, -80.191788], zoom_start=12)
for point in locations_list[:170]:
folium.Marker(point).add_to(folium_map)
folium_map

At the end we found 3 significant correlations in heights, year of birth and hair type of SUB_TYPD (offender & predator).

What else I would like to do?

link to Github

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade