Daily Dose of Bias for Data Scientists and Everyday Life — Day 1
6 Main Biases with Some Examples in Python (That You Can Skip if You Want)
Cognitive biases are systematic errors in thinking that affect people’s judgments and decisions. They arise from patterns of thinking that simplify or alter the processing of information, leading to deviations from logic and objective judgment. These biases can cause distorted perceptions, inaccurate assessments, and irrational interpretations of reality.
In this first article of our series, we will explore the following cognitive biases and their impact on data science and daily life:
- Anchoring Bias in Data Science: How initial information can unduly influence subsequent judgments and decisions.
- Apophenia in Data Science: The tendency to perceive meaningful connections and patterns in random data.
- Overfitting in Machine Learning: Creating models that are too complex and capture noise rather than the true underlying pattern.
- Spurious Correlations: Identifying false relationships between variables that appear to be correlated but are not causally related.
- Confirmation Bias in Data Science: Interpreting data in a way that confirms preexisting beliefs, leading to incorrect conclusions.
- Hindsight Bias in Data Science: Believing, after an event has occurred, that we could have predicted or expected the outcome.
For some biases, we provide examples in Python to illustrate their practical implications. However, if you’re more interested in understanding the concepts without diving into the technical details, you can skip the code sections.
Stay tuned as we explore these biases, shedding light on how they impact our work and decision-making in data science and beyond.
Anchoring Bias in Data Science
Anchoring bias is a cognitive bias where individuals rely too heavily on the first piece of information (the “anchor”) they encounter when making decisions. This initial information serves as a reference point and can heavily influence subsequent judgments and decisions, even if the anchor is irrelevant or misleading.
Explanation of Anchoring Bias
In data science, anchoring bias can manifest in various ways, such as:
- Analysts being overly influenced by initial data points or prior information when interpreting new data.
- Decision-makers relying too heavily on the first results they encounter during analysis, which may skew their final conclusions.
- Misleading initial data causing biased models or misinterpreted results.
Anchoring bias can negatively impact the objectivity of data analysis and lead to incorrect conclusions, making it crucial to recognize and mitigate its effects.
Example of Anchoring Bias in Python
To demonstrate anchoring bias, let’s consider a simple example where an analyst is evaluating house prices. We’ll simulate a situation where the analyst’s initial price estimate influences their subsequent judgments.
We will use Python to simulate what happens in our minds when the effect of anchoring bias is present.
Step-by-Step Code Example
- Simulate House Prices Data: We’ll create a list of house prices to serve as our dataset.
- Introduce an Anchor: We’ll set an initial anchor value and see how it affects subsequent price estimates.
- Analyze the Impact of Anchoring: We’ll compare the estimates with and without the influence of the anchor.
Here’s the code to illustrate this:
import numpy as np
# Simulate house prices (in thousands of dollars)
np.random.seed(0)
house_prices = np.random.randint(100, 1000, size=100)
# Function to calculate the average estimate without anchoring
def unbiased_estimate(prices):
return np.mean(prices)
# Function to calculate the average estimate with anchoring
def biased_estimate(prices, anchor):
# Apply anchoring bias by weighting the anchor heavily
biased_prices = np.append(prices, [anchor] * 10)
return np.mean(biased_prices)
# Unbiased estimate
unbiased_avg = unbiased_estimate(house_prices)
print(f"Unbiased Average House Price: ${unbiased_avg:.2f}k")
# Anchored estimate with an initial anchor of 200k
anchor = 200
biased_avg = biased_estimate(house_prices, anchor)
print(f"Anchored Average House Price: ${biased_avg:.2f}k")
Unbiased Average House Price: $570.69k
Anchored Average House Price: $536.99k
Code explanation
[anchor] * 10 creates a list with the value of anchor repeated 10 times.
When you multiply a list by an integer in Python, it repeats the elements in the list that many times.
The np.append() function is used to add elements to an array. In this line, you are appending the list [anchor] * 10 to the original prices array. This effectively adds the anchor value 10 times to the prices array, which then biases the average calculation.
Analysis of the Results — What happens in our mind:
- Unbiased Average: The average house price calculated without any bias is approximately $570.69k.
- Anchored Average: The average house price influenced by an anchor value of $200k is approximately $536.99k.
This example shows how an initial anchor value (in this case, $200k) can skew the overall average, leading to a biased estimate. The biased estimate is significantly lower than the unbiased average, demonstrating how anchoring can distort the interpretation of data.
Mitigating Anchoring Bias
To mitigate anchoring bias in data science:
- Be aware of initial anchors: Recognize potential anchor values and their influence on your analysis.
- Use multiple reference points: Instead of relying on a single initial value, consider a range of values or multiple sources of information.
- Cross-validate results: Validate your findings using different subsets of data and various methodologies to ensure robustness.
- Peer review and collaboration: Engage colleagues in the review process to identify and correct potential biases.
Apophenia in Data Science
Apophenia is the human tendency to perceive connections and patterns in random or unrelated information. This can lead to seeing familiar shapes in clouds, hearing melodies in noise, or finding connections in data that do not actually exist. In data science, apophenia can result in the identification of spurious correlations, leading to misleading conclusions and faulty models.
In data science, apophenia can manifest in the following ways:
- Overfitting: Creating models that are too complex and capture noise rather than the true underlying pattern.
- Spurious Correlations: Finding correlations in data that are due to randomness rather than a real relationship.
- Confirmation Bias: Interpreting data in a way that confirms preexisting beliefs, leading to incorrect conclusions.
In statistics, apophenia could be classified as a type I error (false positive).
Overfitting in Machine Learning
Overfitting occurs when a machine learning model learns not only the true underlying patterns in the training data but also the noise. This results in a model that performs well on the training data but poorly on unseen data, as it fails to generalize.
Overfitting happens when:
- Model complexity is too high: The model has too many parameters relative to the number of observations.
- Noise is learned: The model learns random fluctuations in the training data as if they were important features.
- Poor generalization: The model performs excellently on training data but poorly on test or validation data.
To mitigate overfitting, techniques such as cross-validation, regularization, and simplifying the model can be employed.
Example of Overfitting in Python
To demonstrate overfitting, let’s consider a dataset where we try to fit polynomial regression models of varying degrees.
Step-by-Step Code Example
- Simulate Data: We’ll create a simple dataset with a known underlying pattern and some added noise.
- Fit Polynomial Models: We’ll fit polynomial regression models of different degrees to the data.
- Visualize Overfitting: We’ll visualize how higher-degree polynomials fit the noise rather than the true underlying pattern.
Here is the code to illustrate this:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Simulate data
np.random.seed(0)
X = np.linspace(0, 10, 100)
y = 2 * np.sin(X) + np.random.normal(0, 0.5, len(X))
# Reshape data for sklearn
X = X[:, np.newaxis]
# Function to fit and plot polynomial regression models
def plot_polynomial_regression(X, y, degrees):
plt.figure(figsize=(14, 10))
for i, degree in enumerate(degrees, start=1):
# Create polynomial features
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
# Fit linear regression model
model = LinearRegression().fit(X_poly, y)
# Make predictions
y_pred = model.predict(X_poly)
# Calculate MSE
mse = mean_squared_error(y, y_pred)
# Plot data and predictions
plt.subplot(2, 2, i)
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred, color='red', label=f'Degree {degree} (MSE: {mse:.2f})')
plt.title(f'Polynomial Regression (Degree {degree})')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# Plot polynomial regression models of varying degrees
Analysis of the Results
The plot will show polynomial regression fits for degrees 1, 3, 6, and 20.
- Degree 1: The linear model (degree 1) underfits the data, missing the true underlying pattern and providing a very simplistic model.
- Degree 3: The polynomial model of degree 3 captures the underlying sinusoidal pattern well without overfitting, providing a reasonable fit.
- Degree 6: The polynomial model of degree 6 starts to capture some noise in the data, leading to a more complex model that slightly overfits.
- Degree 20: The polynomial model of degree 20 significantly overfits the data, capturing noise and fluctuations and resulting in a very complex and wiggly curve.
Explanation
- Degree 1 (Underfitting): The model is too simple to capture the underlying pattern.
- Degree 3 (Good Fit): The model captures the underlying pattern well without overfitting.
- Degree 6 (Mild Overfitting): The model starts to capture noise, leading to a slightly more complex fit.
- Degree 20 (Overfitting): The model is too complex, fitting noise and fluctuations in the data, which leads to poor generalization on unseen data.
Mitigating Overfitting
To mitigate overfitting:
- Use Cross-Validation: Evaluate model performance on different subsets of the data to ensure it generalizes well.
- Simplify the Model: Use simpler models with fewer parameters.
- Regularization: Apply techniques like Lasso or Ridge regression to penalize large coefficients and reduce model complexity.
- More Data: Use more training data to ensure the model learns the true underlying patterns rather than noise.
By recognizing and addressing overfitting, data scientists can create models that generalize better to new, unseen data, leading to more robust and reliable predictions.
Spurious Correlations
To understand spurious correlation, you can read my previous article:
Confirmation Bias in Data Science
Confirmation bias is the tendency to search for, interpret, and remember information in a way that confirms one’s preexisting beliefs or hypotheses. This bias can lead to incorrect conclusions because it causes people to favor information that supports their views while disregarding or undervaluing evidence that contradicts them.
In data science, confirmation bias can manifest in the following ways:
- Selective Data Analysis: Analysts may focus on data that supports their hypotheses and ignore data that contradicts them.
- Misinterpretation of Results: Results are interpreted in a way that confirms the analyst’s preconceptions, even if alternative interpretations are more plausible.
- Overlooking Flaws: Potential flaws or limitations in the data or methodology are overlooked if the results support the desired outcome.
Recognizing and mitigating the effects of confirmation bias is crucial for objective and accurate data analysis.
Example of Confirmation Bias in Python
To demonstrate confirmation bias, let’s consider an example where an analyst believes that more advertising leads to higher sales and selectively interprets data to support this hypothesis.
Step-by-Step Code Example
- Simulate Data: We’ll create a dataset with advertising spend and sales, including some noise.
- Analyze Data with Bias: We’ll perform an analysis that selectively interprets the data to support the hypothesis that more advertising leads to higher sales.
- Perform Unbiased Analysis: We’ll perform an unbiased analysis to compare the results.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Simulate data
np.random.seed(0)
advertising_spend = np.linspace(0, 100, 100)
sales = 3 * advertising_spend + np.random.normal(0, 30, len(advertising_spend)) + 200
# Create a DataFrame
data = pd.DataFrame({'Advertising_Spend': advertising_spend, 'Sales': sales})
# Introduce confirmation bias by selecting only a subset of data
biased_data = data[data['Advertising_Spend'] > 50]
# Fit linear regression model on biased data
biased_model = LinearRegression().fit(biased_data[['Advertising_Spend']], biased_data['Sales'])
biased_predictions = biased_model.predict(biased_data[['Advertising_Spend']])
biased_mse = mean_squared_error(biased_data['Sales'], biased_predictions)
biased_r2 = r2_score(biased_data['Sales'], biased_predictions)
# Fit linear regression model on unbiased data
unbiased_model = LinearRegression().fit(data[['Advertising_Spend']], data['Sales'])
unbiased_predictions = unbiased_model.predict(data[['Advertising_Spend']])
unbiased_mse = mean_squared_error(data['Sales'], unbiased_predictions)
unbiased_r2 = r2_score(data['Sales'], unbiased_predictions)
# Plot the results
plt.figure(figsize=(14, 6))
# Plot biased analysis
plt.subplot(1, 2, 1)
plt.scatter(biased_data['Advertising_Spend'], biased_data['Sales'], color='blue', label='Data')
plt.plot(biased_data['Advertising_Spend'], biased_predictions, color='red', label='Fit')
plt.title('Biased Analysis')
plt.xlabel('Advertising Spend')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.text(60, 350, f'MSE: {biased_mse:.2f}\nR2: {biased_r2:.2f}', color='red')
# Plot unbiased analysis
plt.subplot(1, 2, 2)
plt.scatter(data['Advertising_Spend'], data['Sales'], color='blue', label='Data')
plt.plot(data['Advertising_Spend'], unbiased_predictions, color='red', label='Fit')
plt.title('Unbiased Analysis')
plt.xlabel('Advertising Spend')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.text(10, 350, f'MSE: {unbiased_mse:.2f}\nR2: {unbiased_r2:.2f}', color='red')
plt.tight_layout()
plt.show()
Analysis of the Results
- Biased Analysis: The biased analysis uses only data points where advertising spend is greater than 50. This selective analysis shows a strong relationship between advertising spend and sales, with a high R2 value and low MSE (Mean Squared Error ).
- Unbiased Analysis: The unbiased analysis uses all the data points and shows the true relationship between advertising spend and sales, with a more realistic R2 value and MSE.
Explanation
- Biased Analysis: By selectively using data where advertising spend is high, the analyst reinforces their belief that more advertising leads to higher sales. This results in a misleadingly strong relationship.
- Unbiased Analysis: Using all the data provides a more accurate representation of the relationship between advertising spend and sales. The true relationship may be weaker than suggested by the biased analysis.
Mitigating Confirmation Bias
To mitigate confirmation bias in data science:
- Use All Relevant Data: Ensure that all relevant data is considered in the analysis, not just data that supports the hypothesis.
- Blind Analysis: Conduct analyses in a way that prevents the analyst from knowing the expected outcome in advance.
- Peer Review: Engage colleagues to review and critique the analysis, helping to identify potential biases.
- Cross-Validation: Use cross-validation techniques to validate the findings and ensure they are not due to selective data interpretation.
Recognizing and mitigating the effects of apophenia
Recognizing and mitigating the effects of apophenia is crucial to ensure that analyses are accurate and reliable.
To sum up, in order to mitigate apophenia in data science:
- Use Statistical Tests: Employ statistical tests to confirm the significance of observed patterns and correlations.
- Cross-Validation: Validate findings with different subsets of data to ensure they are not due to randomness.
- Avoid Overfitting: Use regularization techniques to prevent overfitting models to noise.
- Peer Review: Engage colleagues to review and critique findings, helping to identify potential biases and spurious connections.
- Avoid spurious correlation: Check for the existence of spurious correlation and ensure there is causation.
By being aware of apophenia and taking steps to mitigate its effects, data scientists can improve the robustness and reliability of their analyses and models.
Hindsight Bias in Data Science
Hindsight bias, also known as the “knew-it-all-along” effect, is the tendency to believe, after an event has occurred, that we could have predicted or expected the outcome. This bias leads to the perception that events were more predictable than they actually were, often causing people to overestimate their ability to foresee outcomes.
In data science, hindsight bias can manifest in several ways:
- Misinterpretation of Model Predictions: Believing that a model’s prediction was obvious after the fact.
- Overconfidence in Predictive Models: Overestimating the accuracy and reliability of predictive models based on past events.
- Neglecting Model Limitations: Ignoring the limitations and uncertainties inherent in models because the outcome is already known.
Recognizing and mitigating the effects of hindsight bias is crucial to ensure objective and accurate data analysis.
Example of Hindsight Bias in Python
To demonstrate hindsight bias, let’s consider an example where we build a model to predict stock prices. After knowing the actual stock prices, we might incorrectly assume that the model’s predictions were obvious and predictable.
Step-by-Step Code Example
- Simulate Stock Price Data: We’ll create a dataset with stock prices over time, including some randomness to simulate real-world data.
- Build a Simple Predictive Model: We’ll build a simple linear regression model to predict stock prices.
- Analyze Hindsight Bias: We’ll compare the model’s predictions with actual prices and demonstrate how hindsight bias can lead to the false belief that the predictions were obvious.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Simulate stock price data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100, freq='D')
prices = np.linspace(100, 200, 100) + np.random.normal(0, 5, 100)
# Create a DataFrame
data = pd.DataFrame({'Date': dates, 'Price': prices})
# Split the data into training and test sets
train = data.iloc[:80]
test = data.iloc[80:]
# Prepare the data for linear regression
X_train = (train['Date'] - train['Date'].min()).dt.days.values.reshape(-1, 1)
y_train = train['Price']
X_test = (test['Date'] - train['Date'].min()).dt.days.values.reshape(-1, 1)
y_test = test['Price']
# Fit linear regression model
model = LinearRegression().fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Calculate MSE and R2
mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
# Plot the results
plt.figure(figsize=(14, 6))
# Plot training data and predictions
plt.subplot(1, 2, 1)
plt.scatter(train['Date'], train['Price'], color='blue', label='Training Data')
plt.plot(train['Date'], y_pred_train, color='red', label='Model Prediction')
plt.title('Training Data and Model Prediction')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.text(train['Date'].iloc[-10], train['Price'].min() + 10, f'MSE: {mse_train:.2f}\nR2: {r2_train:.2f}', color='red')
plt.xticks(rotation=45)
# Plot test data and predictions
plt.subplot(1, 2, 2)
plt.scatter(test['Date'], test['Price'], color='blue', label='Test Data')
plt.plot(test['Date'], y_pred_test, color='red', label='Model Prediction')
plt.title('Test Data and Model Prediction')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.text(test['Date'].iloc[-10], test['Price'].min() + 10, f'MSE: {mse_test:.2f}\nR2: {r2_test:.2f}', color='red')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Analysis of the Results
- Training Data and Model Prediction: The model fits the training data well, with an R2 value indicating the proportion of variance explained by the model.
- Test Data and Model Prediction: The model’s performance on the test data gives a more realistic measure of its predictive power.
Hindsight Bias Keypoints
- Before Knowing the Outcome: Without knowing the actual test data outcomes, the model’s predictions may seem uncertain.
- After Knowing the Outcome: Once the actual test data outcomes are known, it’s easy to fall into hindsight bias and believe that the model’s predictions were obvious and could have been foreseen.
Mitigating Hindsight Bias
To mitigate hindsight bias in data science:
- Separate Training and Test Data: Always use a separate test set to evaluate model performance on unseen data.
- Cross-Validation: Use cross-validation techniques to assess model performance across different subsets of the data.
- Acknowledge Uncertainty: Recognize and communicate the uncertainties and limitations of predictive models.
- Avoid Retrospective Adjustments: Avoid adjusting models or interpretations based on knowledge of the outcomes.
Our Daily Dose of Bias for Data Scientists and Everyday Life continue with the second article of thes series: