Exploratory Data Analysis with Python Jupyter Notebook: A tutorial on how to perform exploratory data analysis (EDA) in Jupyter Notebook, covering data cleaning, data preprocessing & data visualization techniques.

8 min readJul 10, 2023

Introduction

Exploratory data analysis is an important first step when working with any new data. It allows us to get familiar with the data, identify any issues or anomalies, and gain insights that help guide the rest of the analysis process.

In this blog post, I will walk through performing exploratory data analysis in Python using Jupyter Notebook.

Importing the Data

The first step is to import the data into your Jupyter Notebook. For Example, We will use the Numpy, Pandas, Seaborn & other libraries to read the data into a DataFrame.

For example, if your data is in a CSV file, you can import it like this:

  import pandas as pd

  df = pd.read_csv("data.csv")

Data Cleaning

Once the data is imported, the next step is to clean the data. This includes:

Handling missing values
We can check for null values using .isnull() and fill them in using .fillna()
Removing duplicates
We can use .drop_duplicates()
Converting data types
Using .astype()

Here are the key steps for data cleaning in exploratory data analysis:

Handle missing values

This is one of the most common issues with real-world data. You can handle missing values in a few ways:

Drop rows with missing values
Fill in missing values with the mean, median, or mode of the column
Fill in missing values with a constant value (like 0)

The approach you take depends on the nature of the data and your analysis goals.

Remove duplicates

Duplicate rows can skew your analysis, so it’s important to identify and remove them. You can use pandas .duplicated() and .drop_duplicates() methods.

Check for outliers

Outliers can significantly impact your results, so you’ll want to identify them and either remove them or handle them appropriately. You can use box plots, histograms, and z-scores to find outliers.

Correct data types

Ensure all columns have the correct data type (string, integer, float, boolean, etc.). You may need to convert some columns using .astype().

Fix inconsistent data

Look for inconsistencies in the data, like spelling variations, different formats, etc., and standardize the data.

Address class imbalance (if relevant)

If you have a target variable with class imbalance, you may need to resample the data to get meaningful results.

Feature engineering

Derive new features from your existing data that might be useful for your analysis. This is an important step for creating an effective model.

Those are the main steps for cleaning your data as part of the exploratory data analysis process.

Data Preprocessing

Next, we perform some preprocessing:

Renaming columns
Using .rename()
Changing column order
Using .reindex()
Dropping unnecessary columns
Encoding categorical variables

key steps for data preprocessing in exploratory data analysis:

Rename columns

Give your columns meaningful and consistent names to make the data easier to work with. Use .rename() in pandas.

Reorder columns

Put the columns in a logical order so related columns are next to each other. Use .reindex() in pandas.

Drop unnecessary columns

Remove any columns that are not relevant to your analysis. Use .drop() in pandas.

Encode categorical variables

If you have categorical variables like ‘gender’ or ‘color’, encode them as integers to prepare for modeling. You can use label encoding, one-hot encoding, or target encoding.

Normalize/standardize numeric variables

If you have numeric features on different scales, normalize or standardize them to the same scale. This helps prevent features with larger ranges from dominating.

Impute missing values

Fill in any remaining missing values after your initial data cleaning. You can use mean, median or mode imputation.

Transform skewed variables

If you have variables with skewed distributions, apply a transformation like a log or square root to make them more normal. This helps many models work best.

Create new features

Derive additional features from your existing data that may be useful for your analysis. This is an important part of feature engineering.

Handle outliers

You may choose to cap outliers at a certain value, winsorize them, or remove them, depending on your analysis goals.

Those are some of the main techniques for preprocessing your data as part of exploratory data analysis. The goal is to transform your raw data into a form that’s easier to analyze and model.

Data Visualization

Now we can visualize the data to gain insights:

.hist() for histograms
.plot() for line plots, scatter plots, etc.
.value_counts() to see counts of categorical variables
.describe() for summary statistics

Main techniques for data visualization during exploratory data analysis:

Histograms

Use .hist() in pandas to get a visual representation of the distribution of a numeric variable. This can reveal outliers, skewness, and other patterns.

Box plots

Use .boxplot() in pandas to visualize the distribution through quartiles, extremes, and outliers for a numeric variable.

Scatter plots

Use .plot(kind=’scatter’) to visualize the relationship between two numeric variables. This can reveal correlations, clusters, and outliers.

Bar plots

Use .plot(kind=’bar’) to compare categorical variables or the counts of categorical variables. This gives a quick visual summary.

Line plots

Use .plot(kind=’line’) to visualize trends over time for time series data.

Pair plots

Use seaborn pairplot() to visualize the relationships between all variables in a dataset.

Correlation heatmaps

Use a seaborn heatmap() to visualize the correlation between all numeric variables.

Pie charts

Use .plot(kind=’pie’) to visualize the proportional breakdown of a categorical variable.

Word clouds

Generate a word cloud to visualize the most common words in a text column.

Descriptive statistics

Use .describe() to get summary stats like count, mean, standard deviation, minimum, and maximum for numeric columns.

These techniques help you gain a quick understanding of your data and reveal patterns, outliers, and relationships that you can then investigate further. Data visualization is a critical part of the exploratory data analysis process.

So we can understand all the above points Importing the Data, Data Cleaning, Data Preprocessing, and Data Visualization from one example:

Demonstration

Note

If you are looking to quickly set up and explore AI/ML & Python Jupyter Notebook Kit, Techlatest.net provides an out-of-the-box setup for AI/ML & Python Jupyter Notebook Kit on AWS, Azure and GCP. Please follow the below links for the step-by-step guide to set up the AI/ML & Python Jupyter Notebook Kit on your choice of cloud platform.

For AI/ML KIT: AWS, GCP & Azure.

Why did you choose Techlatest.net VM, AI/ML Kit & Python Jupyter Notebook?

In-browser editing of code
Ability to run and execute code in various programming languages
Supports rich media outputs like images, videos, charts, etc.
Supports connecting to external data sources
Supports collaborative editing by multiple users
Simple interface to create and manage notebooks
Ability to save and share notebooks

During VM selection “We are selecting GPU instance by going to GPU tab and selecting the desired GPU instance type. GPU instance will give 10 to 15 times better performance compared to equivalent CPU instance, however, GPU instances will have a significantly higher cost, so choose the right instance type for your performance and budget requirement”.

I take all three examples of AWS, GCP & Azure for Your Reference.

After setup the VM, we can log in to Jupyter Hub, so below you can see step by step guide.

Step 1

This VM comes with the default Ubuntu as an admin user. So to access the Web UI and to install additional packages, log in with the Ubuntu user and the password you set during the first login to the Jupyter Notebook.

Step 2

Open a Terminal in your Jupyter Notebook and enter the below command to install the there package using pip.

sudo -E pip install there

Note: Don’t forget to use sudo in the above command.

Step 3

Import some libraries.

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt # visualizing data
%matplotlib inline
import seaborn as sns

Step 4

Import the data, if your data is in a CSV file you can import it like this,

# import csv file
df = pd.read_csv('Diwali Sales Data.csv', encoding= 'unicode_escape')

Step 5

Once the data is imported, the next step is to clean the data.

Step 6

Next, we perform some preprocessing:

Step 7

Now we can visualize the data to gain insights:

# plotting a bar chart for Gender and it's count

ax = sns.countplot(x = 'Gender',data = df)

for bars in ax.containers:
    ax.bar_label(bars)

# plotting a bar chart for gender vs total amount

sales_gen = df.groupby(['Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

sns.barplot(x = 'Gender',y= 'Amount' ,data = sales_gen)

From the above graphs, we can see that most of the buyers are females and even the purchasing power of females is greater than men.

Age

ax = sns.countplot(data = df, x = 'Age Group', hue = 'Gender')

for bars in ax.containers:
    ax.bar_label(bars)

Conclusion

The blog then provided a demonstration of the above steps using an example dataset, showcasing how to import the data, clean it, preprocess it, and visualize it. Finally, the blog concluded with a brief recommendation for using Techlatest.net’s VM, AI/ML Kit, and Python Jupyter Notebook for setting up an AI/ML development environment.

Overall, the blog aimed to provide a comprehensive overview of the exploratory data analysis process and highlight the importance of each step in understanding and preparing the data for further analysis and modeling.

Exploratory Data Analysis with Python Jupyter Notebook: A tutorial on how to perform exploratory data analysis (EDA) in Jupyter Notebook, covering data cleaning, data preprocessing & data visualization techniques.

Introduction

Importing the Data

Data Cleaning

Here are the key steps for data cleaning in exploratory data analysis:

Handle missing values

Remove duplicates

Check for outliers

Correct data types

Fix inconsistent data

Address class imbalance (if relevant)

Feature engineering

Data Preprocessing

key steps for data preprocessing in exploratory data analysis:

Rename columns

Reorder columns

Drop unnecessary columns

Encode categorical variables

Normalize/standardize numeric variables

Impute missing values

Transform skewed variables

Create new features

Handle outliers

Data Visualization

Main techniques for data visualization during exploratory data analysis:

Histograms

Box plots

Scatter plots

Bar plots

Line plots

Pair plots

Correlation heatmaps

Pie charts

Word clouds

Descriptive statistics

Demonstration

Note

Why did you choose Techlatest.net VM, AI/ML Kit & Python Jupyter Notebook?

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Age

Conclusion

Written by TechLatest.Net