Python data story telling.
In my case I used an open source data from Miami-Dade County Open Data Hub.
I chose the Sexual predator Dataset.
import pandas as pdsp = pd.read_csv('Sexual_Predator.csv')print(sp.shape)
sp.head()


As you can see from the sp.shape above I had 1333 raws and 35 columns. Quite a few of the columns have a lot of NaN values.
I’ve dropped the columns with more than 95% of NaN values and with personal information like SNAME, MNAME
sp = sp.drop(columns=['FNAME', 'MNAME', 'LNAME', 'SNAME', 'ADDR_TYPD', 'ADDRESS2','IMAGE_ID', 'ADDRESS', 'DOC_NBR', 'X_COORD', 'Y_COORD', 'ZIP4', ])Another way to drop the multiple columns
TRAN_cols = [col for col in sp.columns if col.startswith('TRAN_')]
sp = sp.drop(columns=TRAN_cols)
sp.shape(1333, 15) # Final datasetsp.sample(3)

Now I can start to prepare the data for the analysis. I decided to have a closer look at the DOB, HEIGHT and WEIGHT columns. I wanted to see if there is any correlation between Predator and Offender in SUB_TYPD. I started by removing ‘lbs’ from WEIGHT column and transforming it from string to integer. Also I renamed the column to WEIGHT(lbs).
sp.rename(columns={'WEIGHT':'WEIGHT(lbs)'}, inplace=True)def remove_lbs_to_int(string):
return int(string.strip('lbs'))
Let’s check how it works
remove_lbs_to_int('190 lbs')
190sp['WEIGHT(lbs)'] = sp['WEIGHT(lbs)'].apply(remove_lbs_to_int)
The HEIGHT column now. I need to remove ‘ and “ and then transform it to the float for the further analisys. I renamed the column to HEIGHT(f) as well.
sp['HEIGHT'] = (sp['HEIGHT'].str.replace("'",''))
sp['HEIGHT'] = sp['HEIGHT'].str.replace('[: ]','.')
sp['HEIGHT'] = sp['HEIGHT'].astype(float)sp.rename(columns={'HEIGHT':'HEIGHT(f)'}, inplace=True)sp.dtypes

I changed DOB column to datetime format and left only Year, removing month and date.
sp['DOB'] = pd.to_datetime(sp['DOB'], infer_datetime_format=True)
sp['DOB'] = sp['DOB'].dt.yearNow I can have a look at the value_counts
sp['DOB'].value_counts()sp['HEIGHT(f)'].value_counts()sp['WEIGHT(lbs)'].value_counts()
Separate the data in to two data sets
sp['SUB_TYPD'].value_counts()
sp_offender = sp[sp['SUB_TYPD'] != 'Predator']
sp_predator = sp[sp['SUB_TYPD'] == 'Predator']
sp_offender.shape, sp_predator.shape((1162, 15), (171, 15))
I used describe to have a a look at the two data sets.
sp_predator.describe(exclude='number')
import matplotlib.pyplot as plt
import seaborn as snsLet’s have a look at the graphics of the (sp) joint data first.
# matplotlib histogram
plt.hist(sp[‘DOB’], color = ‘blue’, edgecolor = ‘black’,
bins = int(180/5))
# seaborn histogram
sns.distplot(sp[‘DOB’], hist=True, kde=False,
bins=int(180/5), color = ‘blue’,
hist_kws={‘edgecolor’:’black’})
# add labels
plt.title(‘DOB Predator & Offender’)
plt.xlabel(‘Year’)
plt.ylabel(‘Total’)
plt.grid(True)


I can run T-test on the two data sets now.
# List of two sub_types to plot
sex_predators = ['Predator', 'Offender']# Iterate through the two sub_types
for sex_predator in sex_predators:# Subset to the SUB_TYPE
subset = sp[sp['SUB_TYPD'] == sex_predator]# Draw the density plot
sns.distplot(subset['DOB'], hist = False, kde = True,
kde_kws = {'linewidth': 3},
label = sex_predator)# Plot formatting
plt.legend(prop={'size': 16}, title = 'DOB')
plt.title('Density Plot with Predator & Offender')
plt.xlabel('Year of DOB')
plt.ylabel('Density')
plt.grid(True)

# we can reject the null hypothesis of equal averages for the DOB with the P.value<0.01ttest_ind(sp_offender['DOB'], sp_predator['DOB'])
Ttest_indResult(statistic=2.924801560184627, pvalue=0.003505144185552153)

# we can NOT reject the null hypothesis of equal averages for their weight P.value>0.01ttest_ind(sp_offender['WEIGHT(lbs)'], sp_predator['WEIGHT(lbs)'])
Ttest_indResult(statistic=-0.10689820042837661, pvalue=0.9148858504526294)

# we can reject the null hypothesis of equal averages for the height with the P.value<0.01ttest_ind(sp_offender['HEIGHT(f)'], sp_predator['HEIGHT(f)'])
Ttest_indResult(statistic=2.663515468741881, pvalue=0.007826203833102494)
With the help of Chi square test I explored categorical features of the data.
categorical_features = [sp.STATUS, sp.RACE_TYPD, sp.SEX, sp.EYE_TYPD, sp.HAIR_TYPD]crosstabs = [pd.crosstab(sp['SUB_TYPD'], feature) for feature in categorical_features]crosstabs[0]

from scipy.stats import chi2_contingencyfor crosstab in crosstabs:
print(crosstab)chi2 = (chi2_contingency(crosstab, correction=False)) # No Yates correctionprint('Chi-square statistic: {}'.format(chi2[0]))
print('p-value: {}'.format(chi2[1]))
print('\n\n')

Chi square confirmed that the column HAIR_TYPD has a significant difference between SUB_TYPD (offender & predator)
I can use folium to see the last known predator’s and/or offenders’s location
import foliumlocations_list = list(zip(sp_predator['Y'], sp_predator['X']))
len(locations_list)171folium_map = folium.Map(location=[25.761681, -80.191788], zoom_start=12)
for point in locations_list[:170]:
folium.Marker(point).add_to(folium_map)folium_map

At the end we found 3 significant correlations in heights, year of birth and hair type of SUB_TYPD (offender & predator).
What else I would like to do?
Find a sample of a men’s height in Miami-Dade County and compare it to the results that I’ve got.
link to Github