Bootstrapping Case study: Profiling internet offenders

Sai Krishna Dammalapati
3 min readMar 5, 2024

--

Hope you’ve read my previous blogs on Confidence Intervals and Bootstrapping before continuing with this blog.

David Glasgow, a Forensic & Clinical Psychologist from the UK, reached out to me with an application of Bootstrapping.

As part of his work, he has to profile internet offenders from their digital records. Basically, he gets data on suspect’s internet history. Each row is a website the suspect has visited. Each of these websites is flagged — either sexual or not. Based on this data, he has to profile if the suspect is an internet offender. Their hypothesis is that the more sexual content a suspect consumes, the more is their propensity to be an internet offender.

Ideally, if David has the entire internet history of a suspect (the population dataset), then he would use it to just calculate the percentage of sexual content in suspect’s web history. Then he could use a threshold to profile the suspect — as a hebephile, paedophile, coercive/violent and adult interests.

But, the police may not be successful in retrieving the entire internet history every time. They may get only a sample internet history. Based on this sample internet history, David wants to build a Bootstrapping Confidence Interval.

I asked David if he could share this data, but given that it contains illegal websites, he was afraid that it would land me in trouble. So, I’ll work on dummy data based on the summary stats shared.

Dummy Data

There is a suspect, whose internet history was retrieved by the police. It consists of 780999 URLS that the suspect has visited in his life time. Each URL is either flagged sexual or not. Apparently 44.67% of these URLs are of sexual nature. We’ll create a dummy data that is the population dataset.

import numpy as np

# Define the total number of rows and the number of ones 44.67% (1 = sexual content)
total_rows = 780999
num_ones = 348872

# Create an array with the desired number of ones and zeros
column = np.zeros(total_rows)
column[:num_ones] = 1

# Shuffle the array to randomize the positions of ones and zeros
np.random.shuffle(column)

Now, imagine David got access to only 78 records of the suspect’s internet history (Not the entire population).

sample = np.random.choice(column, size=78, replace=True)

Based on these 78 records, he wants to construct a Bootstrapping CI.

from scipy.stats import bootstrap

sample = (sample,)
#calculate 95% bootstrap for mean
bootstraped_percentile = bootstrap(sample, np.mean, confidence_level=0.95,
random_state=111, method='percentile')
print(bootstraped_bca.confidence_interval)

Upon performing the bootstrapping, I get the following results:

ConfidenceInterval(low=0.3076923076923077, high=0.5256410256410257)

David also wants to know how the sample size affects the width of the Bootstrapping CI.

I perform the above steps with sample sizes 78, 781 and 7810.

The width of the CI is reducing with increasing sample size as expected. But this tells us that one must be careful with smaller sample sizes while profiling the suspect.

These Confidence Intervals can be used to profile the suspect as a hebephile, paedophile, coercive/violent and adult interests.

Caveat:

1. The sample may not be randomly generated in real life. The police, say, extracted the sample only from the phone of the suspect. In general, people might consume more sexual content from their phones. The CI estimated from phone sample would be positively biased.

--

--

Sai Krishna Dammalapati

Interested in inter-sectoral areas of Technology and Socio-Economic Development.