Daily Dose of Bias for Data Scientists and Everyday Life — Day 2

6 Main Biases with Some Examples in Python (That You Can Skip if You Want)

17 min readSep 10, 2024

Cognitive biases continue to shape our judgments and decisions in subtle yet powerful ways. In this second article of our series, we will explore another set of biases and their implications in data science. Understanding these biases helps us recognize potential pitfalls in our thinking and improve our decision-making processes.

6 Main Biases with Some Examples in Python (That You Can Skip if You Want)

Gambler’s Fallacy in Data Science: The mistaken belief that future probabilities are influenced by past events in a statistically independent series of events.
Outcome Bias in Data Science: Judging a decision based on its outcome rather than the quality of the decision at the time it was made.
The IKEA Effect in Data Science: Overvaluing things that we have created ourselves, leading to potential biases in evaluating models or solutions.
Overconfidence Effect: Overestimating one’s own abilities, knowledge, or predictions, leading to risky decisions.
Impostor Syndrome: Feeling inadequate and doubting one’s accomplishments, despite evident success, which can hinder performance and growth.
The Planning Fallacy: Underestimating the time, costs, and risks of future actions while overestimating the benefits, leading to unrealistic plans and timelines.

For some biases, we provide examples in Python to illustrate their practical implications. However, if you’re more interested in understanding the concepts without exploring the technical details, you can skip the code sections.

Stay tuned as we dive into these biases, shedding light on how they impact our work and decision-making in data science and beyond.

Gambler’s Fallacy in Data Science

Gambler’s fallacy is the erroneous belief that the occurrence of one event in a random sequence affects the probability of future events. For example, believing that after a series of heads in coin tosses, a tail is “due” to occur, even though each coin toss is independent and the probability remains 50/50.

In data science, gambler’s fallacy can manifest in the following ways:

Misinterpretation of Random Processes: Assuming that random processes will “balance out” over short sequences.
Bias in Predictive Models: Incorporating past random outcomes as predictors of future events inappropriately.
Incorrect Decision Making: Making decisions based on the mistaken belief that past events influence future probabilities.

Recognizing and mitigating the effects of gambler’s fallacy is crucial to ensure objective and accurate data analysis.

Example of Gambler’s Fallacy in Python

To demonstrate gambler’s fallacy, let’s simulate a series of coin tosses and analyze the belief that the outcome of future tosses depends on past outcomes.

Step-by-Step Code Example

Simulate Coin Tosses: We’ll create a dataset of coin tosses and count the number of heads and tails.
Analyze Gambler’s Fallacy: We’ll calculate the probabilities and demonstrate the fallacy that future outcomes depend on past events.

import numpy as np
import matplotlib.pyplot as plt

# Simulate coin tosses
np.random.seed(42)
n_tosses = 100
tosses = np.random.choice(['Heads', 'Tails'], size=n_tosses)

# Calculate cumulative counts of heads and tails
heads_count = np.cumsum(tosses == 'Heads')
tails_count = np.cumsum(tosses == 'Tails')

# Calculate probabilities of heads and tails over time
prob_heads = heads_count / np.arange(1, n_tosses + 1)
prob_tails = tails_count / np.arange(1, n_tosses + 1)

# Plot the cumulative counts and probabilities
plt.figure(figsize=(14, 6))

# Cumulative counts
plt.subplot(1, 2, 1)
plt.plot(heads_count, label='Heads Count')
plt.plot(tails_count, label='Tails Count')
plt.xlabel('Number of Tosses')
plt.ylabel('Cumulative Count')
plt.title('Cumulative Counts of Heads and Tails')
plt.legend()
plt.grid(True)

# Probabilities
plt.subplot(1, 2, 2)
plt.plot(prob_heads, label='Probability of Heads')
plt.plot(prob_tails, label='Probability of Tails')
plt.axhline(0.5, color='red', linestyle='--', label='True Probability')
plt.xlabel('Number of Tosses')
plt.ylabel('Probability')
plt.title('Probability of Heads and Tails Over Time')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

Analysis of the Results

Cumulative Counts: The cumulative counts of heads and tails fluctuate over time but do not necessarily balance out immediately. There can be streaks where one outcome occurs more frequently than the other.
Probabilities: The probabilities of heads and tails approach 0.5 as the number of tosses increases, reflecting the true probability of each outcome. However, short-term fluctuations do not indicate that one outcome is “due.”

Explanation of Gambler’s Fallacy

Independence of Events: Each coin toss is an independent event with a 50/50 probability, regardless of previous outcomes.
Misinterpretation: Believing that a tail is “due” after a series of heads is a misinterpretation of probability. The likelihood of heads or tails remains constant at 50% for each toss.

Mitigating Gambler’s Fallacy

To mitigate gambler’s fallacy in data science:

Understand Independence: Recognize that many random processes consist of independent events where past outcomes do not influence future probabilities.
Educate on Probability: Educate stakeholders on the nature of probability and independence to prevent incorrect assumptions and decisions.
Use Statistical Tests: Use appropriate statistical tests to analyze data and avoid drawing conclusions based on false assumptions about randomness.
Validate Models: Ensure predictive models are validated rigorously and do not incorporate erroneous assumptions about dependencies in random processes.

Understanding Outcome Bias in Data Science

Outcome bias is the tendency to judge a decision based on its outcome rather than on the quality of the decision at the time it was made. This bias leads to the belief that a good outcome implies a good decision and a bad outcome implies a bad decision, ignoring the decision-making process and the information available at the time.

In data science, outcome bias can manifest in the following ways:

Misjudging Model Performance: Evaluating a model based solely on its performance on a specific dataset without considering the data quality, feature selection, and other factors.
Incorrect Business Decisions: Judging business decisions based on outcomes rather than the rationale and data behind the decisions.
Overlooking Process: Ignoring the decision-making process and focusing only on the results, which can lead to repeating flawed methodologies.

Recognizing and mitigating the effects of outcome bias is crucial to ensure objective and fair evaluation of decisions and models.

Example of Outcome Bias with Low Success Rate but Lucky Outcome

Simulate Investment Outcomes: We’ll create a dataset with different possible outcomes for a risky investment where the success rate is low.
Highlight a Lucky Outcome: We’ll demonstrate how a single successful outcome might lead to the perception that the decision was good, despite the overall low success rate.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Simulate investment outcomes
np.random.seed(42)
n_simulations = 1000
initial_investment = 10000
risk_factor = 0.5
mean_return = -0.05  # Negative mean return to ensure low success rate
returns = np.random.normal(loc=mean_return, scale=risk_factor, size=n_simulations)

# Calculate final investment values
final_values = initial_investment * (1 + returns)

# Create a DataFrame
data = pd.DataFrame({'Final_Value': final_values})

# Calculate statistics
mean_final_value = np.mean(final_values)
median_final_value = np.median(final_values)
success_rate = np.sum(final_values > initial_investment) / n_simulations

# Identify a lucky outcome
lucky_outcome = final_values[final_values > initial_investment][0]

# Plot the distribution of final investment values
plt.figure(figsize=(10, 6))
plt.hist(final_values, bins=30, color='skyblue', edgecolor='black')
plt.axvline(initial_investment, color='red', linestyle='dashed', linewidth=1, label='Initial Investment')
plt.axvline(mean_final_value, color='green', linestyle='dashed', linewidth=1, label='Mean Final Value')
plt.axvline(median_final_value, color='blue', linestyle='dashed', linewidth=1, label='Median Final Value')
plt.axvline(lucky_outcome, color='purple', linestyle='dashed', linewidth=1, label='Lucky Outcome')
plt.title('Distribution of Final Investment Values')
plt.xlabel('Final Value')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.show()

# Print summary statistics
print(f"Mean Final Value: ${mean_final_value:.2f}")
print(f"Median Final Value: ${median_final_value:.2f}")
print(f"Success Rate (Final Value > Initial Investment): {success_rate:.2%}")
print(f"Lucky Outcome: ${lucky_outcome:.2f}")

Analysis of the Results

Mean and Median Final Values: The mean and median final values are both likely to be less than the initial investment, indicating that most outcomes are poor.
Success Rate: The success rate indicates the proportion of simulations where the final investment value is greater than the initial investment, which is expected to be low (e.g., 15%).
Lucky Outcome: Despite the low success rate, there is a highlighted outcome where the investment was successful, illustrating the potential for outcome bias.

Explanation of Outcome Bias

Overall Performance: The overall performance of the investment strategy is poor, with a low mean and median final value and a low success rate.
Perception of Decision: The single lucky outcome might lead to the perception that the investment decision was good, despite the overall low success rate.

Mitigating Outcome Bias

To mitigate outcome bias in data science and decision-making:

Evaluate Decision-Making Process: Focus on the quality of the decision-making process, including the data and rationale used at the time, rather than the outcome.
Use Probabilistic Thinking: Consider the probabilities of different outcomes and assess decisions based on expected value and risk management.
Document Assumptions: Clearly document assumptions, data, and analysis methods used in decision-making to provide context for evaluating the decision.
Peer Review: Engage colleagues in reviewing the decision-making process to ensure that it is based on sound principles and not influenced by the outcome.

The IKEA Effect in Data Science

IKEA effect is the cognitive bias where people place a higher value on products they have partially created themselves. This phenomenon can lead individuals to overvalue their contributions and efforts simply because they were involved in the creation process. In data science, this bias can impact the development and evaluation of models, tools, and analyses.

In data science, the IKEA effect can manifest in the following ways:

Overvaluing Self-Created Models: Data scientists might overvalue their own models and analyses, believing them to be better than they actually are because of the effort they put into creating them.
Resistance to Change: There might be resistance to adopting new methods or models that were not created by the individual or team, even if they are objectively better.
Biased Evaluation: The evaluation of self-created models or tools might be biased, leading to overestimation of their quality and effectiveness.

Recognizing and mitigating the effects of the IKEA effect is crucial to ensure objective evaluation and improvement of data science projects.

Example of the IKEA Effect in Python

To demonstrate the IKEA effect, let’s consider a scenario where a data scientist builds a custom model and overvalues it compared to a pre-built model with better performance metrics.

Step-by-Step Example: Custom Model vs. AutoML

Simulate Data: We’ll create a simple dataset for a regression problem.
Build a Simple Neural Network Model: We’ll build a simple neural network regression model using TensorFlow and Keras.
Use PyCaret for AutoML: We’ll use PyCaret to automatically find the best model and hyperparameters.
Compare Performance: We’ll compare the performance of the custom neural network model and the AutoML model using mean squared error (MSE) and R2 score.

Make sure you have pycaret installed. You can install it using pip:

pip install pycaret

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from pycaret.regression import setup, compare_models, predict_model, save_model, load_model

# Simulate data
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 2 * np.sin(X) + np.random.normal(0, 0.5, len(X))

# Create a DataFrame
data = pd.DataFrame({'X': X, 'y': y})

# Split the data into training and test sets
train, test = train_test_split(data, test_size=0.2, random_state=42)

# Build a custom linear regression model
custom_model = LinearRegression().fit(train[['X']], train['y'])
y_pred_custom = custom_model.predict(test[['X']])

# Use PyCaret to find the best model
exp_reg101 = setup(data=train, target='y')
best_model = compare_models()

# Predict using the best model found by PyCaret
predictions = predict_model(best_model, data=test)
y_pred_automl = predictions['y']

# Calculate metrics for the custom model
mse_custom = mean_squared_error(test['y'], y_pred_custom)
r2_custom = r2_score(test['y'], y_pred_custom)

# Calculate metrics for the AutoML model
mse_automl = mean_squared_error(test['y'], y_pred_automl)
r2_automl = r2_score(test['y'], y_pred_automl)

# Plot the results
plt.figure(figsize=(14, 6))

# Plot custom model predictions
plt.subplot(1, 2, 1)
plt.scatter(test['X'], test['y'], color='blue', label='Test Data')
plt.plot(test['X'], y_pred_custom, color='red', label='Custom Model Prediction')
plt.title('Custom Linear Regression Model')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.text(1, 2, f'MSE: {mse_custom:.2f}\nR2: {r2_custom:.2f}', color='red')

# Plot AutoML model predictions
plt.subplot(1, 2, 2)
plt.scatter(test['X'], test['y'], color='blue', label='Test Data')
plt.plot(test['X'], y_pred_automl, color='green', label='AutoML Model Prediction')
plt.title('AutoML Model')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.text(1, 2, f'MSE: {mse_automl:.2f}\nR2: {r2_automl:.2f}', color='green')

plt.tight_layout()
plt.show()

# Print summary statistics
print(f"Custom Model - MSE: {mse_custom:.2f}, R2: {r2_custom:.2f}")
print(f"AutoML Model - MSE: {mse_automl:.2f}, R2: {r2_automl:.2f}")
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 2 * np.sin(X) + np.random.normal(0, 0.5, len(X))

Custom Model - MSE: 2.45, R2: -0.09
AutoML Model - MSE: 0.00, R2: 1.00

In the image below, you can see the results of a model comparison made by PyCaret in our example, to choose the most performant model.

Analysis of the Results

Custom Linear Regression Model: The custom linear regression model is simple and might not capture the underlying patterns in the data as effectively as more complex models. The custom neural network model has a Mean Squared Error (MSE) of 2.45 and an R2 score of -0.09. This indicates that the model is not performing well, as the negative R2 score suggests that the model is worse than a horizontal line representing the mean of the data.
AutoML Model: The AutoML solution (PyCaret) automatically explores a variety of models and hyperparameters, potentially finding a model that better fits the data. The AutoML model found by PyCaret has an MSE of 0.00 and an R2 score of 1.00. This indicates perfect performance, with the model accurately predicting all test data points.

Explanation of IKEA Effect

Overvaluing Custom Model: The data scientist might overvalue the custom linear regression model due to personal effort and involvement in creating it, even if the performance metrics (MSE and R2) are worse than those of the AutoML model.
Objective Evaluation: Despite the effort put into building the custom model, an objective evaluation based on performance metrics might show that the AutoML model is better.

Mitigating IKEA Effect

To mitigate the IKEA effect in data science:

Use Objective Metrics: Rely on objective performance metrics to evaluate models, rather than personal attachment or effort invested.
Before building new and complicated custom models, consider looking for an existing one.
Peer Review: Engage colleagues to review models and analyses to provide unbiased feedback.
Regular Benchmarks: Compare custom models with standard benchmarks or AutoML solutions to ensure that the chosen models are genuinely the best option.
Embrace Change: Be open to adopting new methods and models that offer better performance, even if they are not self-created.

Overconfidence Effect

Overconfidence Effect refers to our tendency to overestimate our abilities, skills, or the accuracy of our predictions. This cognitive bias is well-documented and occurs when an individual’s subjective confidence in their judgments exceeds the objective accuracy of those judgments. Overconfidence is a type of error in assessing subjective probabilities.

Types of Overconfidence

In research literature, overconfidence has been defined in three distinct ways:

Overestimation of Actual Performance: Believing that one’s actual performance is better than it truly is.
Overplacement: Believing that one’s performance is better compared to others more than it actually is.
Excessive Precision: Expressing unwarranted certainty in the accuracy of one’s beliefs.

Connection to the IKEA Effect

IKEA Effect is when individuals place higher value on products they have created themselves, leading to an overvaluation of their contributions. This bias can significantly impact data science projects, where a data scientist might overvalue a model they have personally developed, despite its poor performance compared to automated solutions or models created by others.

When we combine the IKEA Effect with the Overconfidence Effect, we see a compounded bias where:

Overestimation of Actual Performance: The data scientist may believe their custom model performs better than it actually does, ignoring objective metrics like MSE and R2 that show otherwise.
Overplacement: They might also believe their custom model is superior to others’ models, to existing solutions or AutoML solutions, despite evidence to the contrary.
Excessive Precision: The data scientist might express unwarranted confidence in the accuracy and effectiveness of their custom model, even when it underperforms compared to alternative models.

Implications in Data Science

In the context of data science, these biases can lead to:

Suboptimal Decision Making: Preferring self-created models over objectively better-performing models.
Resistance to Better Solutions: Ignoring or underutilizing existing solutions that might provide better results due to personal investment in self-created models.
Overlooking Objective Evaluation: Failing to rely on objective performance metrics and peer reviews, leading to biased model selection and potentially poorer outcomes.

Mitigating the Combined Bias

To mitigate the combined effects of the IKEA Effect and Overconfidence Effect:

Emphasize Objective Metrics: Regularly use and trust objective performance metrics (e.g., MSE, R2) to evaluate models.
Promote Peer Review: Encourage peer reviews and feedback to provide an unbiased assessment of models and predictions.
Adopt Benchmarking Practices: Consistently compare custom models with standard benchmarks or automated solutions to ensure the best models are selected.
Foster Open-mindedness: Cultivate an openness to adopting new methods and models that offer better performance, even if they are not self-created.
Educate on Cognitive Biases: Increase awareness and understanding of cognitive biases among data scientists to foster more critical and objective thinking.

Impostor Syndrome

Impostor Syndrome is a psychological pattern where individuals doubt their accomplishments and have a persistent fear of being exposed as a “fraud,” despite evident success and achievements. This syndrome is the opposite of the Overconfidence Effect and can significantly impact individuals in various fields, including data science.

Characteristics of Impostor Syndrome

Self-Doubt: Constantly questioning one’s abilities and achievements.
Fear of Exposure: Persistent fear of being exposed as incompetent or a fraud.
Attributing Success to External Factors: Believing that success is due to luck, timing, or other external factors rather than one’s own skills and efforts.
Perfectionism: Setting excessively high standards and feeling like a failure when those standards are not met.

Connection to Data Science

In the context of data science, Impostor Syndrome can manifest in several ways:

Underestimating Skills and Knowledge: A data scientist may underestimate their technical skills and analytical abilities, even when they are proficient and competent.
Avoiding Opportunities: Fear of failure or exposure may lead to avoiding new projects, opportunities, or challenges that could enhance their career.
Reluctance to Share Work: Hesitation to share findings, present at conferences, or publish papers due to fear of being judged or criticized.
Over-preparation: Spending excessive time on tasks and projects to ensure perfection, which can lead to burnout.

Implications in Data Science

Impostor Syndrome can have several negative effects on data scientists and their work:

Reduced Confidence: Lack of confidence in one’s abilities can hinder decision-making and problem-solving.
Missed Opportunities: Avoiding challenging projects or roles can limit career growth and professional development.
Decreased Collaboration: Reluctance to share ideas and work can reduce collaboration and innovation within teams.
Mental Health Impact: Chronic self-doubt and fear of exposure can lead to anxiety, stress, and decreased job satisfaction.

Mitigating Impostor Syndrome

To mitigate the effects of Impostor Syndrome in data science:

Recognize and Acknowledge: Acknowledge that Impostor Syndrome is a common experience and not a reflection of one’s actual abilities.
Celebrate Achievements: Regularly reflect on and celebrate accomplishments and successes, no matter how small.
Seek Feedback: Request feedback from peers, mentors, and supervisors to gain an objective perspective on skills and performance.
Engage in Professional Development: Continuously build skills and knowledge through training, courses, and workshops to boost confidence.
Practice Self-Compassion: Be kind to oneself and recognize that perfection is unattainable; mistakes and setbacks are part of the learning process.
Find a Support Network: Connect with colleagues, mentors, and professional groups to share experiences and receive support and encouragement.

Impostor Syndrome and the Overconfidence Effect represent two extremes of self-perception. While the Overconfidence Effect involves overestimating one’s abilities and judgments, Impostor Syndrome involves underestimating them. Both can significantly impact performance and decision-making in data science.

The Planning Fallacy

Planning Fallacy is a cognitive bias where individuals underestimate the time, effort, and cost required to complete a task. This bias leads people to set overly optimistic timelines and budgets, often resulting in delays, increased costs, and unmet expectations.

Characteristics of Planning Fallacy

Underestimating Time: Believing that tasks will take less time than they actually do.
Underestimating Effort: Assuming that tasks require less effort and resources than needed.
Ignoring Potential Problems: Failing to anticipate potential obstacles and delays.
Optimism Bias: Displaying an unrealistic level of optimism about one’s abilities and circumstances.

Connection to Data Science

In the context of data science, the Planning Fallacy can manifest in various ways:

Project Timelines: Underestimating the time needed to complete data science projects, including data collection, cleaning, modeling, and evaluation.
Resource Allocation: Failing to allocate sufficient resources, such as personnel, computational power, and budget, leading to project delays and overruns.
Model Development: Assuming that developing, testing, and validating models will be quicker and easier than it actually is.
Implementation: Believing that implementing a data science solution in a production environment will be straightforward, without considering integration challenges and maintenance needs.

Implications in Data Science

The Planning Fallacy can have several negative effects on data science projects:

Delays: Projects take longer than expected, impacting deadlines and deliverables.
Increased Costs: Underestimated budgets lead to higher actual costs due to additional resource requirements.
Frustration and Stress: Unrealistic timelines and expectations create stress for data science teams and stakeholders.
Compromised Quality: Rushed timelines may lead to shortcuts and compromised quality in data analysis and model development.

Mitigating the Planning Fallacy

To mitigate the Planning Fallacy in data science:

Realistic Estimations: Base time and effort estimates on historical data and past experiences, not just optimistic assumptions.
Break Down Tasks: Divide projects into smaller, manageable tasks with individual timelines and resource requirements.
Buffer Time: Add buffer time to project timelines to account for unforeseen delays and obstacles.
Seek Input: Consult with team members and stakeholders to gather diverse perspectives on time and effort estimations.
Regular Updates: Frequently update project timelines and resource allocations based on progress and any changes in scope.
Risk Management: Identify potential risks and develop contingency plans to address them.

The Planning Fallacy is a common bias that affects the estimation of time, effort, and cost required to complete tasks. In data science, this bias can lead to project delays, increased costs, and compromised quality. By recognizing and addressing the Planning Fallacy, data scientists and project managers can create more realistic project plans, improve resource allocation, and enhance the overall success of data science initiatives.

Let’s create a simple Python example to demonstrate the Planning Fallacy. We’ll simulate a data science project that involves data collection, cleaning, model training, and evaluation. We’ll estimate the time required for each task and then compare it with the actual time taken.

Example: Simulating a Data Science Project with Planning Fallacy

Estimate the Time Required for Each Task:

Data collection: 2 hours
Data cleaning: 1 hour
Model training: 1 hour
Model evaluation: 1 hour

Actual Time Taken for Each Task (simulated):

Data collection: 3 hours
Data cleaning: 2 hours
Model training: 1.5 hours
Model evaluation: 1.5 hours

import time
import random

# Simulated time taken for each task (in hours)
actual_times = {
    'data_collection': 3,
    'data_cleaning': 2,
    'model_training': 1.5,
    'model_evaluation': 1.5
}

# Estimated time for each task (in hours)
estimated_times = {
    'data_collection': 2,
    'data_cleaning': 1,
    'model_training': 1,
    'model_evaluation': 1
}

# Function to simulate task execution with actual times
def perform_task(task_name, actual_time):
    print(f"Starting {task_name}...")
    # Simulate the time taken to perform the task
    time.sleep(actual_time * 0.1)  # Using sleep for simulation purposes
    print(f"Completed {task_name} in {actual_time} hours.\n")

# Simulate the data science project
def simulate_project():
    print("Estimating project timeline...\n")
    total_estimated_time = sum(estimated_times.values())
    print(f"Total estimated time: {total_estimated_time} hours\n")

    print("Starting project...\n")
    total_actual_time = 0
    for task, actual_time in actual_times.items():
        perform_task(task, actual_time)
        total_actual_time += actual_time

    print(f"Total actual time taken: {total_actual_time} hours\n")
    print(f"Planning Fallacy: Estimated time was {total_estimated_time} hours, but actual time was {total_actual_time} hours.")

# Run the simulation
simulate_project()

Explanation of the Code

Estimated Times: The dictionary estimated_times contains the estimated time for each task based on optimistic assumptions.
Actual Times: The dictionary actual_times contains the actual time taken for each task, which is longer than the estimates.
perform_task Function: This function simulates the execution of a task by sleeping for a fraction of the actual time taken.
simulate_project Function: This function simulates the entire project, calculates the total estimated and actual time, and prints the results.

Estimating project timeline...

Total estimated time: 5 hours

Starting project...

Starting data_collection...
Completed data_collection in 3 hours.

Starting data_cleaning...
Completed data_cleaning in 2 hours.

Starting model_training...
Completed model_training in 1.5 hours.

Starting model_evaluation...
Completed model_evaluation in 1.5 hours.

Total actual time taken: 8.0 hours

Planning Fallacy: Estimated time was 5 hours, but actual time was 8.0 hours.

This example demonstrates the Planning Fallacy by comparing the estimated time for a data science project with the actual time taken. The actual time significantly exceeds the estimates, illustrating the tendency to underestimate the time required for tasks. By recognizing this bias, data scientists and project managers can create more realistic project plans and improve the accuracy of their time estimates.

Daily Dose of Bias for Data Scientists and Everyday Life — Day 2

6 Main Biases with Some Examples in Python (That You Can Skip if You Want)

6 Main Biases with Some Examples in Python (That You Can Skip if You Want)

Gambler’s Fallacy in Data Science

Example of Gambler’s Fallacy in Python

Step-by-Step Code Example

Analysis of the Results

Explanation of Gambler’s Fallacy

Mitigating Gambler’s Fallacy

Understanding Outcome Bias in Data Science

Example of Outcome Bias with Low Success Rate but Lucky Outcome

Analysis of the Results

Explanation of Outcome Bias

Mitigating Outcome Bias

The IKEA Effect in Data Science

Example of the IKEA Effect in Python

Step-by-Step Example: Custom Model vs. AutoML

Analysis of the Results

Explanation of IKEA Effect

Mitigating IKEA Effect

Overconfidence Effect

Types of Overconfidence

Connection to the IKEA Effect

Implications in Data Science

Mitigating the Combined Bias

Impostor Syndrome

Characteristics of Impostor Syndrome

Connection to Data Science

Implications in Data Science

Mitigating Impostor Syndrome

The Planning Fallacy

Characteristics of Planning Fallacy

Connection to Data Science

Implications in Data Science

Mitigating the Planning Fallacy

Example: Simulating a Data Science Project with Planning Fallacy

Explanation of the Code

Written by Gianpiero Andrenacci

No responses yet