[Data Analysis] Cleaning data (4/9)

9 min readSep 24, 2023

Master dataset cleaning using Python and pandas with the Iris dataset. Perfect for budding Data Analysts — unlock the power of clean data!

[This guide is part 4 of an 9-article walkthrough.]

Key concepts:

Visual Studio Code · Python · Jupyter Notebook · Data analysis · Data cleaning

Welcome to the fascinating world of data cleaning! In this step-by-step tutorial, we will demystify the art of cleaning datasets using the renowned Iris dataset as our playground. Whether you’re just starting or looking to refine your skills, this guide will equip you to handle real-world data with confidence.

To remind ourselves where in the process data cleaning comes into play, here is a general outline of the data analysis process:

Define Objectives: Clearly understand the goals of your analysis.
Data Acquisition: Obtain the dataset you’ll be working with.
Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
➡️ Data Cleaning: Preprocess the data to ensure its quality and consistency:
◦ Handle missing data (e.g., imputation or removal).
◦ Handle duplicates if present.
◦ Convert data types as needed.
◦ Correct any data inconsistencies or errors.
Data Visualization: Create visualizations to gain insights into the data.
Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

In this guide, we will be looking at step 4: data cleaning.

Step 1: Setting Up Your Environment

First, let’s begin by ensuring you have the necessary tools.

Install Python & Pandas: If you don’t already have Python and pandas installed, don’t worry, this guide has your back.
Install VS Code: Additionally, you’ll need a code editor like Visual Studio Code (VS Code) installed (step-by-step guide).

Step 2: Importing the Essential Libraries

Open VS Code: Open VS Code, create a new Jupyter notebook (.ipynb) and connect your kernel.
◦ Check this guide if you need help connecting a kernel.

Opening a new Jupyter notebook (.ipynb) in VS Code

Import the required libraries: Once you’ve opened a new Jupyter notebook file, import the pandas library in the first cell. This is the heart and soul of data manipulation:

# Import the pandas library for data manipulation

import pandas as pd

Step 3: Loading the Iris Dataset

Download the Iris dataset: You can easily find this dataset online in CSV format. For example, on Kaggle.
◦ Check my previous article, or Google, for step-by-step instructions
Save the Iris dataset to your computer: Once you have the dataset, place it in your working directory, or anywhere where you have easy access to.
Load the dataset into VS Code: To load it into a pandas DataFrame, use the following line of code:
◦ Ideally, the dataset (CSV file) and your Jupyter Notebook (.ipynb) should be in the same file, for easy access.
◦ If you encounter an error here, check here for troubleshooting tips.

# Load the iris_dataset.csv into VS Code 

df = pd.read_csv('iris_dataset.csv')

Loading the iris dataset in VS Code with pd.read_csv()

Step 4: Initial Exploration

To understand your dataset, it’s crucial to take a peek inside.

Summary statistics: Execute the following commands to gain a quick overview of your dataset:

# Display the first few rows
df.head()

# Get data types and non-null counts
df.info()

# Summary statistics for numerical columns
df.describe()

Here we can see that:

There are 5 columns of data: 4 numeric and 1 categorical
There are 150 rows of data, with no null values in any of the columns
The min, max and standard deviation tell us that there are not likely to be many, if any, outliers.
◦ As the {min + standard deviation} & {max - standard deviation} values are close to the mean value

For more about summary statistics: check here.

Step 5: Data Cleaning

Now that we’ve got a feel for our data, we can turn our attention to cleaning it. This usually involves a combination of:

Handling missing values
Handling duplicates
Converting data types
Renaming columns
Creating new columns
Handling outliers

5.1. Handling Missing Values

Luckily for us, we saw that our data doesn’t have any null or missing values.

To (re-)check for missing values, we can use the following code:

# Count missing values in the dataset

df.isnull().sum()

In this case, as our values are all there, we don’t have to do anything and we can skip this step.

However, if our dataset was missing values, we could take a closer look at which rows were affected, with the following code:

# Examine rows with missing values in any of the columns

df.loc[df.isnull().any(axis=1)]

The rows of data that contain missing values in the iris dataset — in this case, none.

We could then decide how to address the missing values, either by removing the rows or by filling in the missing values with a number of our choice.

Removing the missing values using the dropna() function:
◦ This will remove any row of data with missing values.
◦ This method usually makes sense if there are a lot of missing values for a particular column or row — you could then simply remove the row/column in question.

# Remove any rows where there are missing values

df = df.dropna()

Filling the missing values using the fillna() function:
◦ There are many ways to fill in the missing values, a discussion of which is out-of-scope for this article, as the ‘best’ method differs for each dataset and each column within a dataset.
◦ For beginners, one of the most common ways to fill in missing values is with the average value of each column.

# Fill rows where there are missing values with the average value 
# of each column
 
df = df.fillna(df.mean())

5.2. Handling Duplicates

Once we have handled our missing values, we can then check for duplicate values. Duplicate values are rows of data that have the same data points for every column of your dataset. This can, of course, distort your analysis and should be avoided.

To handle duplicate values, we can:

Count how many duplicates there are, to understand the scale of the problem:

# Count duplicate values

df.duplicated().sum()

A sum of the duplicate values in the iris dataset

Examine the affected rows, to see if the duplicate values really are duplicates or if there is some other explanation:

# Check the duplicate rows

df.loc[df.duplicated()]

Drop the duplicate values, if you think that the values should be dropped, based on your knowledge of the dataset.
◦ This is where your domain knowledge and understanding of the topic of the dataset will come in handy.
◦ ❗️In our case, I would recommend against dropping the duplicate values as (i) there aren’t many of them and (ii) it’s feasible for flowers to have the same characteristics — akin to multiple people having the same height and weight.

# Drop duplicate values

df.drop_duplicates(inplace=True)

5.3. Data Type Conversion

Sometimes, you may need to convert data types of some of your columns.

To check what data types you have, you can use the info() function:

# Check the data types of your columns

df.info()

In our case, our datatypes (Dtype) are well matched:
◦ Our numerical values are floats, which is a good fit, as our numerical data has decimal points.
◦ Our species column is an object, which is usually used for text-based data.

Should we want to change the data type of a column, we could use the astype() function, like below:
◦ You can read the code as:

“Create a new column called ‘sepal length’ (or overwrite an existing column called ‘sepal length’, if there is one). The values for this new column will take the current column called ‘sepal length’ but with the values as a float data type”

# Convert a column to a datetyme format

df['sepal_length'] = df['sepal_length'].astype(float)

Check here for more information of the different data types there are.¹

5.4. Renaming Columns

For clarity and consistency, we can also consider renaming columns:

To rename a column, enter the following code:
◦ In our case, our column names are well named.

# Rename a column(s)

df.rename(columns={'old_column_name': 'new_column_name'}, inplace=True)

Good practice is usually to have column names in lowercase and, if there are multiple words, to have each word joined by an underscore.
◦ For example: sepal_length

5.5. Data Transformation

Depending on your analysis goals, you might need to perform various data transformations, such as encoding categorical variables, scaling numerical features, or creating new ones.

In our case, we don’t need to do any data transformation.

However, as an example, we could create a new column called ‘sum_lengths’, where we add together the ‘sepal_length’ and ‘petal_length’ columns.

# Create a new column called: 'sum_lengths'
## Calculation: 'sepal_length' + 'petal_length'

df['sum_lengths'] = df['sepal_length'] + df['petal_length']

We can then check the new column using the head() function:

# Check the sum_lengths column

df.head()

Checking out the ‘sum_lengths’ column with df.head() function

5.6. Outlier Detection and Handling

Finally, we can also look at any outliers our dataset might have.

Outliers in a dataset are data points or observations that significantly deviate from the majority of other data points. They are data values that are unusually extreme or different from the typical pattern or distribution of the dataset. Outliers can greatly impact your results.

One method for detecting outliers is by using statistical methods like z-scores²:

# Create a z-score for the column 'sepal_length'

from scipy import stats

z_scores = stats.zscore(df['sepal_length'])

# Display the z-scores
z_scores

z-scores for the ‘sepal_length’ column of the iris database

# Check rows with a z-score over 3

df[(z_scores > 3)]

Rows where the z-score is over 3 — in our case, none.

In our case, we don’t have any rows with a z-score of 3 or more.

If we did, we could run the following code to remove them:

# Remove any rows with a z-score over 3 (keep rows with a z-score < 3)

df = df[(z_scores < 3)]

❗️Note: A data point with a z-score exceeding 3 is considered an outlier due to its significant deviation from the other data points.³ (Ektamaini, 2020)

Step 6: Saving the Cleaned Dataset

Once your dataset is pristine, don’t forget to save your hard work. You can create a new file to store the cleaned data:

# Save the cleaned data to a csv file on your computer

df.to_csv('cleaned_iris.csv', index=False)

Conclusion:

Congratulations! 🎉 You’ve just completed an introduction to data cleaning with Python and pandas, using the Iris dataset as your guide.

Here’s the process as a whole:

Cleaning the iris dataset in VS Code using Python

Data cleaning is a vital skill in the Data Analyst’s toolkit. With these skills, you’re well on your way to transforming messy data into valuable insights. Keep practicing with various datasets, and soon, you’ll be a proficient data cleaner and analyst.

Up next, we will look at how we can visualise this clean data, to start looking for patterns and insights.

Happy data cleaning!

Reference(s):

¹ Pandas. (2023). ‘Essential basic functionality: dtypes’. Pandas. https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

² Glen, S. (2023). ‘Z-Score: Definition, Formula and Calculation. StatisticsHowTo.com: Elementary Statistics for the rest of us!’ https://www.statisticshowto.com/probability-and-statistics/z-score/

³ Ektamaini. (2020, August 27). ‘Z score for Outlier Detection — Python’. GeeksforGeeks. https://www.geeksforgeeks.org/z-score-for-outlier-detection-python/