What the Data Doesn’t Tell: The Dark Side of Data Manipulation

Published in

Machine Learning Turkiye

8 min readFeb 8, 2024

“I see a train wreck looming…” cautioned Nobel Prize winner and “Thinking, Fast and Slow” author, Daniel Kahneman, in a 2012 open letter, expressing concern about the shaky foundations of much research. In the following years, several researchers systematically endeavored to replicate numerous highly cited priming experiments. Unfortunately, many of these replication attempts proved unsuccessful.

Knowing what is wrong is as important as knowing what is right.

The assertion that identical experiments will consistently produce equivalent outcomes, regardless of the experimenter, stands as a fundamental principle supporting science’s pursuit of objective truth. For instance, the universality of the law of gravity affirms its status as an objective truth within this framework.

Data snooping

Data snooping, also known as data dredging/p-hacking, involves misusing data analysis to unearth patterns that appear statistically significant, leading to an inflated risk of false positives. It entails conducting numerous statistical tests and selectively reporting only those yielding significant results.

In 2015, John Bohannon, using the pseudonym Johannes Bohannon, authored a deliberately flawed study titled ‘Chocolate with High Cocoa Content as a Weight-Loss Accelerator.’ The purpose was to assess how the media would amplify these ‘meaningless’ findings. “Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages.

John Bohannon explains “Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements — weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc. — from 15 people. (One subject was dropped.) That study design is a recipe for false positives.”

P(winning) = 1 — (1 — p)^n

With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.

It’s called p-hacking — fiddling with your experimental design and data to push p under 0.05 — and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.”

“If you torture the data long enough, it will confess to anything”

Ronald H. Coase

Jelly beans experiment

In this Python simulation, I will demonstrate an example of data snooping.

Part 1: Does eating jelly beans cause acne?

To explore the impact of jelly bean consumption on acne, I conducted a survey with 500 participants. The collected data includes two variables: ‘acne_condition,’ representing acne severity on a scale of 0 to 1, generated from a uniform distribution U(0,1); and ‘eating,’ a binary variable drawn from a Bernoulli distribution Bern(0.9), assuming 90% of the population consumes jelly beans regularly.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(2)

people = 500
p_eat_jelly_bean = 0.9

# draw a sample
acne_condition = np.random.rand(people)
eating = np.random.choice(['eating', 'not eating'], people, replace=True, p=[p_eat_jelly_bean, 1 - p_eat_jelly_bean])

# create data frame
data = pd.DataFrame({'acne_condition': acne_condition, 'eating': eating})
data.head()

# Set the plot size
plt.figure(figsize=(5, 4))

# Boxplot for eating vs acne_condition
sns.boxplot(x='eating', y='acne_condition', data=data, palette={'eating': 'yellow', 'not eating': 'blue'})

# Remove legend and adjust plot title
plt.legend().set_visible(False)
plt.title('Distribution of acne condition\nfor people eating/not eating jelly beans', loc='center')

# Set axis labels
plt.xlabel('Eating jelly beans')
plt.ylabel('Acne condition')

# Show the plot
plt.show()

We formulate the null hypothesis: ‘There is no discernible effect of consuming jelly beans on acne condition.’ To assess this hypothesis, a t-test is conducted.

from scipy.stats import ttest_ind

# Extract the acne_condition values for each group
eating_values = data[data['eating'] == 'eating']['acne_condition']
not_eating_values = data[data['eating'] == 'not eating']['acne_condition']

# Perform t-test
t_statistic, p_value = ttest_ind(eating_values, not_eating_values)

# Display the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

With a high p-value (p=0.95>0.05), we fail to reject the null hypothesis. In simpler terms, there is insufficient evidence to claim that consuming jelly beans causes acne. This aligns with expectations, considering the independent generation of ‘eating’ and ‘acne_condition’ variables.

Part 2: Which jelly bean color causes acne?

Given the non-significant p-value, suggesting no clear relationship between general jelly bean consumption and acne, my colleague proposes a more nuanced analysis. To explore potential differences among jelly bean colors, we augment the dataset with color information. Assuming 20 different jelly bean colors, each consumed with equal probability, we introduce a new random variable, ‘jelly_bean_color,’ for further investigation.

np.random.seed(2)

# Create a list of colors
colors = ['red', 'grey', 'blue', 'yellow', 'orange', 'purple', 'limegreen',
          'cyan', 'brown', 'pink', 'gold', 'salmon', 'magenta',
          'peachpuff', 'tan', 'aquamarine', 'green', 'coral', 'steelblue', 'beige']

# Add color column to the data
data['jelly_bean_color'] = np.random.choice(colors, people, replace=True)
data.loc[data['eating'] == 'not eating', 'jelly_bean_color'] = np.nan

# Display the head of the modified DataFrame
data.head()

We visualize the distribution of acne conditions across various jelly bean colors.

# Set the plot size
plt.figure(figsize=(18, 5))

# Filter out rows with NA in jelly_bean_color
filtered_data = data.dropna(subset=['jelly_bean_color'])

# Boxplot for jelly_bean_color vs acne_condition
sns.boxplot(x='jelly_bean_color', y='acne_condition', data=filtered_data, hue='jelly_bean_color', palette=colors, hue_order=colors)

# Remove legend
# plt.legend().set_visible(False)

# Set axis labels
plt.xlabel('Jelly bean color')
plt.ylabel('Acne condition')

# Show the plot
plt.show()

To check the influence of a specific jelly bean color on acne condition you again run t-test.

import pandas as pd
from scipy.stats import ttest_ind

def test_color(data, color):
    data['eating_color'] = data.apply(lambda row: 'yes' if row['jelly_bean_color'] == color and not pd.isna(row['jelly_bean_color']) else 'no', axis=1)
    p_value = ttest_ind(data[data['eating_color'] == 'yes']['acne_condition'], data[data['eating_color'] == 'no']['acne_condition']).pvalue
    return p_value

p_value_red = test_color(data, 'red')
print(f"P-value for 'red': {p_value_red}")

You try blue color next. Same result!

p_value_blue = test_color(data, 'blue')
print(f"P-value for 'blue': {p_value_blue}")

So you run t-test for all 20 jelly bean colors and you observe the following distribution of p-values.

# Create t-test data
ttest_data = pd.DataFrame({'color': colors, 'pval': [test_color(data, color) for color in colors]})

# Set the plot size
plt.figure(figsize=(18, 4))

# Scatter plot with colors
sns.scatterplot(x='color', y='pval', hue='color', data=ttest_data, palette=colors, s=100)

# Add dashed line at y = 0.05
plt.axhline(y=0.05, linestyle='--', color='black')

# Set y-axis limits
plt.ylim(0, 1)

# Set axis labels
plt.xlabel('Jelly bean color')
plt.ylabel('p-value')

# Hide the legend
plt.legend().set_visible(False)

# Show the plot
plt.show()

You discover that yellow jelly bean color has significant p-value.

p_value_yellow = test_color(data, 'yellow')
print(f"P-value for 'yellow': {p_value_yellow}")

You finally publish your breaking news!

Explanation

Let’s delve into what’s happening here. Initially, it appears counterintuitive: the data generation process suggests that acne should be entirely unrelated to consuming jelly beans of any color.

The issue lies not with the data itself, but rather with the testing procedure. Despite there being no actual correlation between consuming jelly beans and acne, the probability of obtaining a significant result remains at 0.05. This phenomenon occurs due to the uniform distribution of p-values under the null hypothesis, resulting in a 5% chance of observing a p-value less than 0.05. By generating jelly beans data 10,000 times and conducting t-tests for each dataset under the null hypothesis ‘there is no effect of eating jelly beans on acne condition,’ we can examine the distribution of p-values. It’s important to note that the data generation process ensures the distribution aligns with the hypothesis.

from scipy.stats import ttest_ind

# Function to generate data
def generate_data(people, p_eat_jelly_bean):
    acne_condition = np.random.rand(people)
    eating = np.random.choice(['eating', 'not eating'], people, replace=True, p=[p_eat_jelly_bean, 1 - p_eat_jelly_bean])
    data = pd.DataFrame({'acne_condition': acne_condition, 'eating': eating})
    return data

# Number of people and probability
people = 500
p_eat_jelly_bean = 0.9

# Number of trials
trial = 10000

# Function to compute p-values for 10000 random data sets
def compute_pvalues(trial):
    pvals = np.zeros(trial)
    for i in range(trial):
        data = generate_data(people, p_eat_jelly_bean)
        pvals[i] = ttest_ind(data[data['eating'] == 'eating']['acne_condition'],
                              data[data['eating'] == 'not eating']['acne_condition']).pvalue
    return pvals

# Run the function
pvals = compute_pvalues(trial)

The histogram illustrates a uniform distribution of p-values, leading to the conclusion that there’s a 5% probability of observing a significant p-value even when the null hypothesis is true.

# Set the plot size
plt.figure(figsize=(5, 4))

# Plot histogram using seaborn
sns.histplot(pvals, bins=np.arange(0, 1.1, 0.1), kde=True, color='orange', alpha=0.5)

# Set axis labels
plt.xlabel('p-values')
plt.ylabel('Density')

# Show the plot
plt.show()

If there’s a 5% chance of observing a significant result under the null hypothesis, testing 20 hypotheses increases the probability of observing at least one significant test to 64%.

This explains why examining 20 different jelly bean colors led to finding one color seemingly associated with acne.

Conclusion

“Vigilance in interpreting data patterns is crucial. As demonstrated earlier, conducting numerous tests and selectively presenting only significant results can be misleading. To mitigate the risks of data snooping, consider the following remedies:

Clearly articulate the testing procedure, providing details on the number of tests conducted.
Implement randomized out-of-sample tests or utilize cross-validation to validate hypotheses.
Document the total number of significance tests performed during the study and consider applying corrections, such as the Bonferroni correction. Alternatively, a less conservative approach could involve using Benjamini and Hochberg’s false discovery rate.”