[Data Analysis] Statistical analysis (7/9)

12 min readOct 25, 2023

Learn the essential steps of statistical analysis using Python and Jupyter notebooks on the Iris dataset. Perfect for aspiring Data Analysts!

[This guide is part 7 of an 9-article walkthrough.]

Key concepts:
Data analysis · Statistical analysis · Data analysis process · Data analysis projects · VS Code · Python

Welcome to the world of data analysis! In this beginner-friendly guide, we’ll explore the Iris flower dataset and walk you through the common steps of statistical analysis using Python in Jupyter notebooks. By the end of this article, you’ll have a solid understanding of statistical analysis and data visualization.

Statistical analysis (one-way ANOVA) of the Iris dataset

To remind ourselves where in the data analysis process statistical analysis comes into play, here is a general outline of the data analysis process:

Define Objectives: Clearly understand the goals of your analysis.
Data Acquisition: Obtain the dataset you’ll be working with.
Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
Data Cleaning: Preprocess the data to ensure its quality and consistency.
Data Visualization: Create visualizations to gain insights into the data. Use libraries like Matplotlib, Seaborn, or Plotly to create plots, charts, and graphs.
Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
➡️ Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
◦ Statistical tests (T-tests, ANOVA, chi-square tests, etc.), for hypothesis testing.
◦ Correlation analysis.
◦ Regression analysis for predictive modeling.
Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

Prerequisites

Step 1: Setting Up Your Environment

Before we start, make sure you have the necessary tools and libraries installed:

Visual Studio Code (VS Code): A coding environment.
◦ Step-by-step guide
Python: A coding language and the backbone of our data analysis.
◦ Step-by-step guide
Jupyter Extension for VS Code: For interactive notebooks within VS Code.
◦ Step-by-step guide
Pandas, Matplotlib, Seaborn: Python libraries for data manipulation and visualization.
◦ Step-by-step guide

Installing a Python package via the command terminal (macOS)

Step 2: Creating a Jupyter Notebook

Launch Visual Studio Code, create a new Jupyter Notebook, connect a kernal, and save the notebook with an appropriate name like: “Iris_Flower_Data_Visualization.ipynb.”
◦ Step-by-step guide

Step 3: Importing Libraries

In your Jupyter Notebook, start by importing the necessary libraries:
◦ Step-by-step guide

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 4: Loading the Iris Flower Dataset

We will use a dataset called the Iris dataset. The Iris dataset is a classic dataset in the field of data analysis and machine learning, often used for classification and data exploration.

Download the dataset: Download the Iris Flower dataset as a CSV file from a trusted source, for example Kaggle.
◦ Step-by-step guide
Upload the dataset to VS Code: Now, load the CSV dataset into a Pandas DataFrame:
◦ Replace ‘your_file_path’ with the actual path to your dataset.
◦ Step-by-step guide

# Import the iris dataset using pd.read_csv 
# Replace 'your_file_path' with the actual path to your dataset.
df = pd.read_csv('your_file_path/iris.csv')

Step 5: Data exploration

Now, we will have a quick check of the data, to get ourselves familiar with it. To do so, we will use the .head() method.
◦ Click here for a more in-depth guide to data exploration.

#Check the first 5 rows of data
df.head()

Step 6: Data cleaning & preprocessing

Finally, we will clean the data by handling missing values, removing duplicates, and addressing outliers:
◦ Step-by-step guide

# Handle missing values
df.dropna(inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Outlier removal (if needed)
# Example: Removing values greater than 2 standard deviations from the mean
from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 2).all(axis=1)]

Statistical Analysis

Step 1: Form Your Hypotheses

Start by crafting statistical hypotheses.
◦ For more information about null- and alternative hypothesis, check here¹.

Comparison by Surbhi, S (2017) on keydifferences.com¹

✅ In this example, we’ll test whether there’s a significant difference in sepal length between the Iris species.

[H0] Null hypothesis: There’s no difference in sepal length between Iris species.
[H1] Alternative hypothesis: Sepal length differs between Iris species.

Step 2: Select Your Research Design Method

Now, we can decide if your research design will be experimental, correlational, or descriptive:

Comparison of differrent research design methods (Walters, 2020² , as cited in Stangor, 2011³)

✅ In our example example, we’ll use an experimental design to see the relationship between:

Independent Variable: Iris species
◦ The different categories of Iris plants, such as “Iris-setosa,” “Iris-versicolor,” and “Iris-virginica”.
Dependent Variable: Sepal length
◦ The variable being measured to see if it differs between the different Iris species.

Step 3: Collect your data

Because we’re eager beavers and good students, we have done this above.
◦ Good practice is also to check, clean and prepare your dataset, which we have also done.

# Import the iris dataset using pd.read_csv 
# Replace 'your_file_path' with the actual path to your dataset.
df = pd.read_csv('your_file_path/iris.csv')

Step 4: Visualise Your Data

Now, to have an overview of the variables we’re working with, it’s good practice to visualise your data.

Here, we will create a box plot of our dependant and independent variable, to see if there are any visible differences between them.

# Group the data by species
grouped = df.groupby(df.species)

# Visualize the data
sns.boxplot(x='species', y='sepal_length', data=df)
plt.title('Sepal Length Distribution by Species')
plt.show()

Box plot of the sepal length of each Iris species

This looks very promising for our statistical analysis. There seems to be very clear differences between the sepal length of each species, as we see that the mean of each species’ sepal length is different to the others.

Step 5: Test Hypothesis

We have our hypothesis, we’ve cleaned our dataset and we’ve visually checked to see if our dependent variable (sepal_length) varies across our independent variable (species).

Now, let’s check if the differences we’ve seen are statistically significant.⁴

Step 5.1: Select a Statistical Test
For the test, we will use a one-way ANOVA test.

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of two or more independent (unrelated) groups.⁵
[Lund Research Ltd. (2018)]

Step 5.2: Check the Assumptions of the Statistical Test
One important part of statistical testing is making sure that our dataset meets the assumptions of the statistical test we are using.

Reminder: You don’t need to remember any of this information about which tests to use and what assumptions there are for each test — there’s simply too much to remember.
Recommendation: Get into into the habit of being able to Google and research for the information you need for the problem you are facing.

The 3 main assumptions for a one-way ANOVA are (Lund Research Ltd., 2018)⁵:

1. Independence:
◦ Definition: Observations in each group should be independent.
◦ Reasoning: Sepal length measurements in the Iris dataset are independent as each corresponds to a different flower.
◦ Verdict: ✅ The independence assumption is met.

2. Normality:
◦ Definition: Data within each group should approximate a normal distribution.
◦ Reasoning: Sepal lengths in Iris species groups often approach a normal distribution. We will perform the Shapiro-Wilk test to confirm.
◦ Verdict: ✅ The normality assumption is often met, especially with larger sample sizes. Formal testing is still recommended for confirmation.

# Extract the 'sepal_length' data for each species
setosa_sepal_length = df[df['species'] == 'Iris-setosa']['sepal_length']
versicolor_sepal_length = df[df['species'] == 'Iris-versicolor']['sepal_length']
virginica_sepal_length = df[df['species'] == 'Iris-virginica']['sepal_length']

# Perform Shapiro-Wilk test for normality
shapiro_test_setosa = stats.shapiro(setosa_sepal_length)
shapiro_test_versicolor = stats.shapiro(versicolor_sepal_length)
shapiro_test_virginica = stats.shapiro(virginica_sepal_length)

# Return a message depending on the result of the p-value
if shapiro_test_setosa.pvalue > 0.05:
    print("Setosa Sepal Length is normally distributed (p-value:", shapiro_test_setosa.pvalue, ")")
else:
    print("Setosa Sepal Length is not normally distributed (p-value:", shapiro_test_setosa.pvalue, ")")

Shapiro-Wilk test of normality on the Iris dataset (species vs. sepal_length)

3. Homogeneity of Variances (Homoscedasticity):
◦ Definition: Variances should be roughly equal between groups.
◦ Reasoning: Sepal lengths among different Iris species often have similar variances. However, we will perform Levene’s test of homogeneity to check.
◦ Verdict: 🔴 The variance is not equal between species.

from scipy.stats import levene

# Perform Levene's test (remember: we defined 'grouped' in our code above)
statistic, p_value = levene(
    grouped.get_group("Iris-setosa")["sepal_length"],
    grouped.get_group("Iris-versicolor")["sepal_length"],
    grouped.get_group("Iris-virginica")["sepal_length"],
)

alpha = 0.05
if p_value < alpha:
    print(f"Reject the null hypothesis. Variance is not equal (p-value: {p_value})")
else:
    print(f"Fail to reject the null hypothesis. Variance is equal (p-value: {p_value})")

Levine test of homogeneity on the Iris dataset (species vs. sepal_length)

🔴 After testing the assumptions, we cannot reasonably say that the Iris dataset meets the criteria of a one-way ANOVA.
◦ As we cannot run a one-way ANOVA, we will run Welch’s ANOVA.

Welch’s ANOVA is an alternative to the traditional one-way ANOVA that does not assume equal variances across groups. It is appropriate when the assumption of homogeneity of variances is violated.
[Stephanie, G (n.d.)]⁷

Step 5.3: Applying the Test to the data
Uf! At long last, we are finally able to run the test on our data.

It’s important to note, however, that it is for such reasons (i.e. meeting the criteria for statistical test assumptions), that we place a lot of emphasis on the data cleaning and exploration parts of our analysis.
As the saying goes: ‘Bad data in, bad data out!’ — and we want to avoid this at all costs.

To apply Welch’s ANOVA to our dataset, we can use the following code:
◦ Note: You will need to install the ‘pingouin’ library if you haven’t already

pip install pingouin

import pingouin as pg

# Welch's ANOVA test
result = pg.welch_anova(data=df, dv='sepal_length', between='species')

# Access the p-value from the result
p_value = result['p-unc'].values[0]

# Format the p-value to display with all decimals
formatted_p_value = "{:.35f}".format(p_value)

# Return a message depending on the result of the p-value
if p_value < 0.05:
    print(f"The sepal length differs significantly between Iris species. \np-value: {formatted_p_value}")
else:
    print(f"There's no significant difference in sepal length between Iris species. \np-value: {formatted_p_value}")

Step 6: Interpret the results

Now that we’ve got the results, we need to interpret them.

Let’s look back at our hypothesis:

[H0] Null hypothesis: There’s no difference in sepal length between Iris species.
[H1] Alternative hypothesis: Sepal length differs between Iris species.

And our results:

Results: 🎉 We can see here that our p-value is below 0.05, so we can reject our null hypothesis and accept our alternative hypothesis:
◦ ✅ There is a difference in the sepal length between Iris species.

❗️Note:
The Welch’s ANOVA test only indicates that there are significant differences in sepal length among at least some of the Iris species.
️However, It doesn’t tell us which ones! 😱

Our next question would then be: which species pairs have these significant differences?

To answer this question, you can perform post-hoc tests to identify which species pairs have these significant differences.

Step 7: Post-hoc tests (if applicable)

As the ANOVA test was statistically significant (i.e., p < 0.05), we can run post-hoc tests, such as Tukey’s HSD (Honestly Significant Difference) or Bonferroni tests.

These tests help identify which specific pairs of species differ significantly in terms of sepal length.⁶

Step 7.1: Run the test(s)

Let’s run the Tukey’s HSD (Honestly Significant Difference) test:

# Import necessary libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import MultiComparison

# Fit a one-way ANOVA model
model = ols('sepal_length ~ species', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Create a MultiComparison object for post-hoc tests
mc = MultiComparison(df['sepal_length'], df['species'])

# Perform Tukey's HSD post-hoc test
tukey_result = mc.tukeyhsd()

# Display the results
print("Tukey's HSD Post-Hoc Test:")
print(tukey_result)

Here’s a breakdown of what the code does:

Import of the necessary libraries, including statsmodels.
Fit a one-way ANOVA model using the ols function from statsmodels.formula.api. This model tests the effect of the ‘species’ variable on the ‘sepal_length’ variable.
Perform the ANOVA analysis and store the results in anova_table.
Create a MultiComparison object that takes the ‘sepal_length’ data and the ‘species’ as inputs.
Use the tukeyhsd() method to perform Tukey’s HSD post-hoc test. The results are stored in the tukey_result variable.
Finally, print the results of the Tukey’s HSD test, which will show you which species pairs have statistically significant differences in sepal length.

Step 7.2: Interpret the Post-hoc Test Results

Tukey’s HSD Post-Hoc Test performed on the Iris dataset

Let’s break down the information presented in the table:

group1 and group2: These columns specify the two groups being compared.
◦ In this context, the groups are the Iris species: Iris-setosa, Iris-versicolor, and Iris-virginica.
meandiff: This column shows the difference in means between the two groups.
◦ For example, between Iris-setosa and Iris-versicolor, the mean sepal length differs by approximately 0.93 units.
p-adj: This column displays the p-value after adjusting for multiple comparisons.
◦ A small p-value (typically less than 0.05) suggests a significant difference between the groups.
lower and upper: These columns provide the confidence interval for the mean difference.
◦ In the case of Iris-setosa and Iris-versicolor, the mean difference lies between 0.6862 and 1.1738.
reject: This column indicates whether you should reject the null hypothesis (that there is no significant difference).
◦ If ‘True’, it means there is a significant difference between the two groups.
◦ If ‘False’, there is no significant difference.

Interpreting the results of the test:

Iris-setosa vs. Iris-versicolor: The mean difference (meandiff) is approximately 0.93, and the p-adj value is less than 0.001 (very significant).
◦ ✅ Therefore, there is a significant difference in sepal length between Iris-setosa and Iris-versicolor.
Iris-setosa vs. Iris-virginica: The mean difference is approximately 1.582, and the p-adj value is less than 0.001 (very significant).
◦ ✅ This indicates a significant difference in sepal length between Iris-setosa and Iris-virginica.
Iris-versicolor vs. Iris-virginica: The mean difference is approximately 0.652, and the p-adj value is less than 0.001 (very significant).
◦ ✅ This shows a significant difference in sepal length between Iris-versicolor and Iris-virginica.

In summary, the results of the Tukey’s HSD test suggest that there are significant differences in sepal length between all pairs of Iris species: Iris-setosa, Iris-versicolor, and Iris-virginica. The p-adj values are very small, indicating strong evidence of these differences.

Summary

By following these steps, you’ve successfully applied statistical analysis to the Iris dataset using Python and Jupyter notebooks in Visual Studio Code.

You can extend this analysis to explore other features and relationships within the dataset.

Happy analyzing!

A statistical analysis of the iris dataset (species vs sepal length)

Reference(s)

¹ Surbhi, S. (2017). ‘Difference Between Null and Alternative Hypothesis.’ Key Differences. https://keydifferences.com/difference-between-null-and-alternative-hypothesis.html. Accessed: October 25, 2023.

² Walters, S. (2020). ‘Psychology — 1st Canadian Edition’. Thompson Rivers University. https://psychology.pressbooks.tru.ca/chapter/3-2-psychologists-use-descriptive-correlational-and-experimental-research-designs-to-understand-behaviour/#:~:text=Descriptive%20research%20is%20designed%20to,to%20assess%20cause%20and%20effect. Accessed: October 25, 2023.

³ Stangor, C. (2011). ‘Research methods for the behavioral sciences (4th ed.)’. Mountain View, CA: Cengage.

⁴ Stephanie, G (n.d.). ‘Statistical Significance: Definition, Examples’ StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/what-is-statistical-significance/. Accessed: October 25, 2023.

⁵ Lund Research Ltd. (2018). ‘One-way ANOVA in SPSS Statistics’. Lund Research Ltd. https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-statistics.php#:~:text=Typically%2C%20a%20one%2Dway%20ANOVA,commonly%20used%20for%20two%20groups). Accessed: October 25, 2023.

⁶ Ott, R. L., & Longnecker, M. (2015). ‘An Introduction to Statistical Methods and Data Analysis (7th ed.)’. Cengage Learning.

⁷ Stephanie, G (n.d.). ‘Welch’s ANOVA: Definition, Assumptions’ StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/what-is-statistical-significance/. Accessed: October 25, 2023.

Version history

v1.0 (2023–10–25): First published
v1.1 (2024–08–01): Updated the ‘formatted_p_value’ variable, as it was throwing an error.