[Data Analysis] Data Exploration: Summary Statistics (3/9)

5 min readSep 20, 2023

Learn how to summarise the Iris Flower dataset in VS Code using Python. Explore df.head(), df.tail(), df.info(), and df.describe() in this beginner-friendly guide for new Data Analysts.

[This guide is part 3 of an 9-article walkthrough.]

Key concepts:

Visual Studio Code · Python · Jupyter Notebook · Data analysis · Summary statistics

Are you an aspiring Data Analyst looking to dive into real-world datasets?

In this tutorial, we’ll walk you through the process of perform initial data summarization using essential commands like: df.head(), df.tail(), df.info(), and df.describe().

Summary statistics for the iris dataset using VS Code

To remind ourselves where in the data analysis process data exploration: summary statistics come into play, here is a general outline of the data analysis process:

Define Objectives: Clearly understand the goals of your analysis.
Data Acquisition: Obtain the dataset you’ll be working with.
➡️ Data Exploration: Explore the dataset to get an initial understanding of its structure and content. Key steps include:
◦ Viewing the first few rows using head() or sample() functions.
◦ Checking the data types and data distribution.
◦ Identifying missing values.
◦ Exploring unique values in categorical columns.
◦ Generating summary statistics using describe().
Data Cleaning: Preprocess the data to ensure its quality and consistency.
Data Visualization: Create visualizations to gain insights into the data.
Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

Prerequisites

Before we start summarising our data, if not already installed, make sure you install the following:

Visual Studio Code
Python
Jupyter Notebook
Pandas

Check here for step-by-step instructions.

Step 1: Loading the Iris Flower Dataset

In order to start, we first need to load a dataset into VS Code to analyse.

Open VS Code: launch VS Code on your computer.
◦ If you haven’t installed it yet, check out our step-by-step guide.
Create a new Jupyter Notebook: click on “File” > “New File” > “Jupyter Notebook” to create a new notebook.
Import the required libraries: At the top of your notebook, import the necessary libraries:

# Import required libraries 

import pandas as pd

Load the iris flower dataset: load the dataset using panda’s read_csv() function:
◦ You can download the Iris dataset from reputable sources online, I used Kaggle.

# Load the dataset from a CSV file

df = pd.read_csv('iris_dataset.csv')

Note:

It’s good practice to keep the CSV files you’re working with in the same directory as your Jupyter notebook.
◦ Check here if you run into trouble during this step.

Step 2: Summarizing the Data

Now that the data is loaded into VS Code, let’s perform some summary statistics on the the dataset to get an initial understanding of how it looks.

df.head()

First, let’s display the first few rows:

# Display the first 5 rows of data

df.head()

df.head() — The first 5 rows of the iris dataset

df.tail()

Followed by the last few rows:

# Display the last 5 rows of data

df.tail()

df.tail() — The last 5 rows of the iris dataset

This helps us to understand how the dataset looks, much like how you would check out an Excel file when you first open it. It gives us important information such as:

What columns are there?
What does the data in each column look like:
◦ Numerical?
◦ Categorical?
◦ Timestamps?
What is the ‘shape’ and ‘feel’ of our dataset: what does it look like?

df.info()

Next, let’s use the info() function to examine the columns of our dataset.

This will show more a more comprehensive overview of what we saw when checking the first and last rows using .head() and .tail(), including:

Column number: the order in which the columns appear in.
Column names: the names of each column.
Non-null count: the amount of null values each column has.
Dtype: the datatype of each column (Dtype)
◦ Float: decimal values or fractional numbers
◦ Interger: whole numbers
◦ Datetime: a timestamp comprising of a date component & a time component
◦ String: a sequence of characters

# Display a summary of the dataset and its columns

df.info()

df.info() — A summary of the columns in the iris dataset

df.describe()

We can also generate a statistical summary of the numerical columns of our dataset using .describe(). This method will provide us with:

Count: the number of not-empty values.
Mean: the average value.
Std: the standard deviation.
Min: the minimum value.
25%: the 25% percentile — the value below which 25% of the data may be found.
50%: the 50% percentile, also known as the median — the value below/above which 50% of the data may be found.
75%: the 75% percentile — the value below which 75% of the data may be found.
Max: the maximum value.

# Display a statistical summary of the dataset and its columns

df.describe()

df.describe() — Summary statistics for the iris dataset

Conclusion

And that’s it — congratulations! 🎉 You’ve successfully loaded the Iris Flower dataset into VS Code using Python and performed an initial summary of the data. This is a fundamental step in any data analysis project, providing you with a solid foundation for further analysis.

We’re now ready to start the next step: cleaning our dataset, ready to visualise.

Stay tuned for more data analysis tutorials — and until then, happy data analysing!