[Data Analysis] Visualising a dataset (5/9)

Sam Taylor
14 min readSep 27, 2023

--

Unlock the power of data visualization with Python in Visual Studio Code. From scatter plots to correlation matrices, explore the Iris Flower dataset to become a skilled data analyst.

[This guide is part 5 of an 9-article walkthrough.]

Key concepts:
Data analysis · Data visualisation · Data analysis process · Data analysis projects · VS Code · Python

Photo by Isaac Smith on Unsplash

Aspiring data analysts, are you ready to elevate your data visualization skills using Python in Visual Studio Code? In this beginner’s guide, we’ll explore a variety of essential visualizations to gain some insights into the Iris Flower dataset. Let’s dive in!

To remind ourselves where in the data analysis process data visualisation comes into play, here is a general outline of the data analysis process:

  1. Define Objectives: Clearly understand the goals of your analysis.
  2. Data Acquisition: Obtain the dataset you’ll be working with.
  3. Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
  4. Data Cleaning: Preprocess the data to ensure its quality and consistency.
  5. ➡️ Data Visualization: Create visualizations to gain insights into the data. Use libraries like Matplotlib, Seaborn, or Plotly to create plots, charts, and graphs.
    ◦ Histograms and bar plots for data distribution.
    ◦ Scatter plots for relationships between variables.
    ◦ Box plots for identifying outliers.
    ◦ Heatmaps for correlation analysis.
  6. Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
  7. Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
  8. Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
  9. Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

Prerequisites

Step 1: Setting Up Your Environment

Before we start, make sure you have the necessary tools and libraries installed:

  • Visual Studio Code (VS Code): A coding environment.
    Step-by-step guide
  • Python: A coding language and the backbone of our data analysis.
    Step-by-step guide
  • Jupyter Extension for VS Code: For interactive notebooks within VS Code.
    Step-by-step guide
  • Pandas, Matplotlib, Seaborn: Python libraries for data manipulation and visualization.
    ◦ Open the terminal or command promt (click here if you need help) and enter the following lines of code, one-by-one, clicking enter between each line of code, before typing the next:
pip install pandas
pip install matplotlib
pip install seaborn

Ensure these are installed by following the guides above or other online guides if needed.

Installing a Python package via the command terminal

Step 2: Creating a Jupyter Notebook

Launch Visual Studio Code, create a new Jupyter Notebook, connect a kernal, and save the notebook with an appropriate name like: “Iris_Flower_Data_Visualization.ipynb.”
Step-by-step guide

Step 3: Importing Libraries

In your Jupyter Notebook, start by importing the necessary libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Importing Python libraries in VS Code

Step 4: Loading the Iris Flower Dataset

  • Download the dataset: Download the Iris Flower dataset as a CSV file from a trusted source, for example Kaggle.
    Step-by-step guide
  • Upload it to VS Code: Then, load it into a Pandas DataFrame:
    ◦ Replace ‘your_file_path’ with the actual path to your dataset.
    Step-by-step guide

4.1. Dataset context
The Iris dataset is a classic dataset in the field of data analysis and machine learning, often used for classification and data exploration. It contains measurements of four features (sepal length, sepal width, petal length, and petal width) for three species of iris flowers: setosa, versicolor, and virginica.

# Import the iris dataset using pd.read_csv 
# Replace 'your_file_path' with the actual path to your dataset.

df = pd.read_csv('your_file_path/iris.csv')
Loading a CSV dataset in VS Code

Step 5: Visualizing the Data

Now, we will create a variety of visualizations to explore our dataset. We will look at the following visualization types:

Bar plot · Scatter plot · Histogram · Box plot · Joint plot · Violin plot ·
Facet grid · Pair plot · Heatmap

For each visualization type, we will:
i. Explain what the visualization type is useful for,
ii. Provide example code to create the visualization type in Python,
iii. Give an explanation of how to read the code,
iv. Give an example of how to interpret the visualization.

5.1. Bar Plot

🤷‍♂ 5.1.1. Use
Bar plots are effective for comparing the means or counts of a numerical variable across different categories. In the Iris dataset, a bar plot can be used to compare the average petal length for each species, providing insights into their differences.

🐍 5.1.2. Code

# Create a bar plot showing the petal length of each flower species

sns.barplot(x='species', y='petal_length', data=df, ci=None);

sns.boxplot(): This line creates a box plot using Seaborn’s boxplot function.
◦ x=‘species’: specifies that the x-axis represents the ‘species’ column.
◦ y=‘petal_length’: specifies that the y-axis represents the ‘petal_length’ column.
◦ data=df: specifies that the data you want to use is the df dataset — our iris dataset, that we imported at the beginnning.
◦ ci=None: this argument disables confidence intervals on the plot.

Iris dataset — Bar plot

💡 5.1.3. Interpretation

  • Setosa has the lowest average petal length than the other species (~1.5cm).
  • Versicolor has a lower average petal length than Virginica but a higher petal length mean than Virginica (~4.2cm).
  • Virginica has the highest average petal length than the other species (~5.3cm).

5.2. Scatter Plot

🤷‍♀ 5.2.1. Use
Scatter plots are excellent for visualizing the relationship between two continuous variables. They help you identify patterns, clusters, outliers, and correlations in your data. In the context of the Iris Flower dataset, a scatter plot can help you understand how features like sepal length and sepal width vary across different species.

🐍 5.2.2. Code

# Create a scatter plot of sepal_length and sepal_width

sns.scatterplot(x='sepal_length', y='sepal_width', data=df, hue='species')

plt.title('Sepal Length vs. Sepal Width')

plt.show();
  • sns.scatterplot(): This line creates a scatter plot using Seaborn’s scatterplot function.
    ◦ x=‘sepal_length’: specifies that the x-axis represents the ‘septal_length’ column.
    ◦ y=‘sepal_width’: specifies that the y-axis represents the ‘sepal_width’ column.
    data=df: specifies that the data you want to use is the df dataset — our iris dataset, that we imported at the beginnning.
    ◦ hue=‘species’: specifies that we should colour the graph, based on the ‘species’ column.
  • plt.title(‘Sepal Length vs. Sepal Width’): This line sets the title of the plot to ‘Sepal Length vs. Sepal Width.’
  • plt.show(): Finally, this command displays the plot.
Iris dataset — Scatter plot

💡 5.2.3. Interpretation

  • Setosa has the shortest sepal length and widest sepal width. The two variables have a strong positive correlation.
    ◦ That is, an increase in sepal length results in an increase in sepal width.
  • Versicolor & Virginica have a weaker positive correlation between sepal width and sepal length.
  • Versicolor & Virginica’s sepal length and width overlap with each other.
    ◦ However, Versicolor’s sepal tends to be slightly shorter on average in comparison to Virginica’s.

5.3. Histogram

🤷‍♂ 5.3.1. Use
Histograms are useful for understanding the distribution of a single variable. They display the frequency or density of data points within specified bins or intervals. You can use histograms to visualize the distribution of sepal length or any other numeric variable in the Iris dataset.

🐍 5.3.2. Code

# Create a histogram to show the distribution of sepal length and sepal width

plt.hist(df['sepal_length'], bins=10, alpha=0.5, label='Sepal Length')

plt.hist(df['sepal_width'], bins=10, alpha=0.5, label='Sepal Width')

plt.legend(loc='upper right');
  • plt.hist(): These lines create histograms using Matplotlib’s hist function.
    ◦ df[‘sepal_length’]: selects the ‘sepal_length’ column to be visualised
    ◦ bins=10: specifies the number of bins.
    ◦ alpha=0.5: sets how transparent the colours should be.
    ◦ label=‘Sepal Length’: adds a legend label with the name ‘Sepal Length’.
  • The above code is repeated again for ‘sepal_width’.
  • plt.legend(): adds a legend to the upper-right corner of the plot.
    loc=‘upper right’: sets the legend in the upper-right corner of the graph.
Iris dataset — Histogram

💡 5.3.3. Interpretation

  • Immediately, we can see that there is likely to be distinct categories or groupings to the data (i.e. different species in our case), as there are multiple peaks in the data (multimodal distribution), particularly for the sepal length.
  • Sepal length is almost always greater than sepal width.
  • Sepal length has a larger range than sepal width.

5.4. Box Plot

🤷‍♀ 5.4.1. Use
Box plots are ideal for visualizing the distribution of a numerical variable across different categories or groups. They display the median, quartiles, and potential outliers. For the Iris dataset, a box plot can show the distribution of petal length for each species, helping you compare their characteristics.

🐍 5.4.2. Code

# Create a box plot to show the petal length by species

sns.boxplot(x='species', y='petal_length', data=df);
  • sns.boxplot(): This line creates a box plot using Seaborn’s boxplot function.
    x=‘species’: specifies that the x-axis represents the ‘species’ column.
    y=‘petal_length’: specifies that the y-axis represents the ‘petal_length’ column.
    data=df: specifies that the data you want to use is the df dataset — our iris dataset, that we imported at the beginnning.
Iris dataset — Box plot

💡 5.4.3. Interpretation

  • There seems to be very few outliers for each species, with Setosa having the most.
  • Setosa has the smallest petal length and a very small range.
  • Versicolor and Virginica’s petal length overlaps more. However, Virginica is on average larger than Versicolor.

5.5. Joint Plot

🤷‍♂ 5.5.1. Use
Joint plots combine a scatter plot with histograms on the axes, providing a deeper view of the relationship between two variables. They are great for understanding the distribution of data points along with their correlation. In the Iris dataset, a joint plot can reveal how sepal length and sepal width are distributed and correlated.

🐍 5.5.2. Code

# Create a joint plot to compare the sepal length by sepal width

sns.jointplot(
x='sepal_length'
, y='sepal_width'
, data=df
, kind='scatter'
, hue='species'
);
  • sns.jointplot(): This line creates a joint plot using Seaborn’s jointplot function.
    ◦ x=‘sepal_length’: specifies that the x-axis represents the ‘septal_length’ column.
    ◦ y=‘sepal_width’: specifies that the y-axis represents the ‘sepal_width’ column.
    ◦ data=df: specifies that the data you want to use is the df dataset — our iris dataset, that we imported at the beginnning.
    ◦ kind=‘scatter’: specifies that the graph should be a scatter graph
    ◦hue=‘species’: specifies that we should colour the graph, based on the ‘species’ column.
Iris dataset — Joint plot

💡 5.5.3. Interpretation

  • Iris Setosa has the shortest sepal length but largest sepal length, which are strongly positively correlated.
  • Iris Versicolor & Iris Virginica are more closely related, both specie’s length and width are positively correlated.
  • Iris Virginica is slightly longer and wider on average compared to Iris Versicolor.

5.6. Violin Plot

🤷‍♀ 5.6.1. Use
Violin plots are similar to box plots but provide a richer visualization of the data distribution. They show the probability density of the data at different values. In the Iris dataset, a violin plot can offer a more detailed view of how petal width varies among the species.

🐍 5.6.2. Code

# Create a violin plot to see the petal width by species 

sns.violinplot(x='species', y='petal_width', data=df);
  • sns.violinplot(): This line creates a violin plot using Seaborn’s violinplot function.
    x=‘species’: specifies that the x-axis represents the ‘species’ column.
    y=‘petal_width’: specifies that the y-axis represents the ‘petal_width’ column.
    data=df: specifies that the data you want to use is the df dataset — our iris dataset, that we imported at the beginnning.
Iris dataset — Violin plot

💡 5.6.3. Interpretation

  • We can see that Iris Setosa has the smallest petal width and a concentrated mean around 0.25.
  • For Versicolor and Virginica, we can see that Virginica has wider petals and a much larger range compared to Versicolor.

5.7. Facet Grid

🤷‍♂ 5.7.1. Use
Facet grids allow you to create multiple plots for different subsets of your data. They are useful when you want to compare distributions or relationships within subgroups of your dataset. In the Iris dataset, you can use a facet grid to compare the distribution of sepal lengths for each species.

🐍 5.7.2. Code

# Create a facet grid to show sepal length by species

g = sns.FacetGrid(df, col='species')

g.map(plt.hist, 'sepal_length');
  • sns.FacetGrid(): This line initializes a FacetGrid using Seaborn.
    df: is the data we want to visualise — our iris dataset
    col=‘species’: specifies that you want to create subplots based on the ‘species’ column.
  • g.map(): This line maps the plotting function (plt.hist) to each subplot in the FacetGrid.
    ‘sepal-length’: this is the column we want to use to separate/group the data by, which is why the graphs return one graph per iris species.
Iris dataset — Facet grid

🐍 5.7.3. Interpretation

  • Setosa has the shortest sepal length of all species.
  • Virginica has the largest sepal length of all species.
  • Setosa and Versicolor’s sepal length seem more evenly distributed than Virginica’s sepal length.

5.8. Pair Plot

🤷‍♀ 5.8.1. Use
Pair plots create a grid of scatter plots for multiple variables, helping you quickly assess the relationships and distributions across all pairs of variables. They are especially valuable for identifying patterns and correlations in multivariate data. For the Iris dataset, a pair plot will display scatter plots for sepal length, sepal width, petal length, and petal width.

🐍 5.8.2. Code

# Create a pair plot to compare our dataframe across multiple graphs

sns.pairplot(df, hue='species');
  • sns.pairplot(): This line creates a pair plot using Seaborn’s pairplot function. It visualizes relationships between all pairs of numeric columns in the dataset.
    df: This is the data we want to visualise — our iris dataset.
    hue=‘species’: The hue parameter colors the data points based on the ‘species’ column.
Iris dataset — Pair plot

💡 5.8.3. Interpretation
There are lots of potential interpretations that can be said here, due to the number of graphs. As such, we won’t mention everything but we will point out a a few that caught our eye:

  • For all species, sepal length and width are positively correlated.
  • For Versicolor and Virginica, {petal length and sepal length} are also positively correlated, but not for Setosa (or at least, not as strongly correlated).
    ◦ We see the same, but weaker, pattern between {petal length and sepal width}.
  • For Versicolor, petal width and length are strongly correlated.

5.9. Heatmap (Correlation Matrix)

🤷‍♂ 5.9.1. Use
Correlation matrices are used to understand the strength and direction of relationships between multiple variables. They are crucial for identifying which variables are positively or negatively correlated. In the Iris dataset, a correlation matrix can reveal how sepal length, sepal width, petal length, and petal width are related to each other.

🐍 5.9.2. Code

# Create a correlation matrix of our data to show the relationships between columns

## Drop the column "species" as it isn't a numerical column and cannot be plotted
correlation_matrix = df.drop(columns=["species"]).corr()

# Make the correlation matrix bigger
plt.figure(figsize=(8, 6))

# Plot the correlation_matrix and set the value labels (annot) and the colours (cmap)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm');
  • correlation_matrix = df.drop(columns=[“species”]).corr(): This line calculates the correlation matrix for all numeric columns in the Iris dataset (df) using the corr() method.
    df.drop(columns=[“species”]): Removes the ‘species’ column from our dataframe, which isn’t numeric and would cause an error if plotted.
    .corr(): Creates a correlation matrix with the data in our dataset.
  • sns.heatmap(): This line creates a heatmap using Seaborn’s heatmap function.
    correlation_matrix: tells the heatmap that we want to visualize the correlation matrix we created above.
    annot=True: adds data values to the heatmap.
    cmap= ‘coolwarm’: changes the color mapping to the cool warm colors, which shows the strength and direction of correlations between variables.
Iris dataset — Correlation matrix

💡5.9.3. Interpretation

  • We can see that {sepal length and petal length} & {petal width and petal length} are very strongly correlated
    ◦ That is, when sepal length increases, so does petal length.
    ◦ That is, when petal width increases, so does petal length.
  • We can also see a negative correlation between {petal length and sepal width}
    ◦ That is, when petal length increases, sepal width decreases.

❗️Note: Here, all three species are combined together, so, although we see some interesting trends for the iris dataset as a whole, it would also be worth comparing each species separately, to examine the correlations for each species.
◦ Here would be the code to do so:
◦ To compare the other species, you can replace ‘Iris-virginica’ with the other species names: ‘Iris-versicolor’ | ‘Iris-setosa’.

# Create a correlation matrix for the Iris Virginica species

# Filter the dataframe for 'Iris-virginica'
df_virginica = df[df["species"] == "Iris-virginica"]

# Drop the 'species' column (as it isn't numerical and cannot be plotted)
df_virginica = df_virginica.drop(columns=["species"])

# Compute the correlation matrix
correlation_matrix = df_virginica.corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Iris-virginica Features')
plt.show()
Correlation heatmap for the Iris Virginica species

Step 6: Findings

Overall, the Iris dataset demonstrates that each iris species has unique characteristics in terms of sepal and petal dimensions. These distinct features make it possible to classify iris flowers into their respective species based on their measurements.

  1. Iris Setosa:
    Sepal Characteristics: Iris setosa has the shortest sepal length and widest sepal width among the three species.
    Petal Characteristics: It has the shortest petal length and petal width.
    Distinct Features: Iris setosa is the most distinguishable species, with significantly different sepal and petal dimensions compared to the other two species.
  2. Iris Versicolor:
    Sepal Characteristics: Iris versicolor has intermediate values for sepal length and sepal width.
    Petal Characteristics: It has moderate petal length and petal width, falling between setosa and virginica.
    Overlap with Other Species: Versicolor’s characteristics overlap with both setosa and virginica, making it less distinct.
  3. Iris Virginica:
    Sepal Characteristics: Iris virginica typically has the longest sepal length among the three species.
    Petal Characteristics: It has the longest petal length and wider petal width.
    Overlap with Versicolor: While virginica has some overlap with versicolor, it generally has larger petal dimensions, making it distinguishable.

Visualisations like scatter plots, box plots, and pair plots have helped to explore and confirm these observations.

Taking one step further, data analysts can use these conclusions to build classification models or gain insights into the morphological differences between the iris species, through machine learning.

That’s all for now — happy analysing!

Version history

  • v1.0 (2023–09-27): First published.
  • v1.1 (2024–08–01): Updated code for the correlation matrixes, to drop the ‘species’ column, as it was causing an error when plotting.

--

--

Sam Taylor

Operations Analyst & Data Enthusiast. Sharing insights to support aspiring data analysts on their journey 🚀. Discover more at: https://samtaylor92.github.io