Exploratory Data Analysis with Python Jupyter Notebook: A tutorial on how to perform exploratory data analysis (EDA) in Jupyter Notebook, covering data cleaning, data preprocessing & data visualization techniques.
Introduction
Exploratory data analysis is an important first step when working with any new data. It allows us to get familiar with the data, identify any issues or anomalies, and gain insights that help guide the rest of the analysis process.
In this blog post, I will walk through performing exploratory data analysis in Python using Jupyter Notebook.
Importing the Data
The first step is to import the data into your Jupyter Notebook. For Example, We will use the Numpy, Pandas, Seaborn & other libraries to read the data into a DataFrame.
For example, if your data is in a CSV file, you can import it like this:
import pandas as pd
df = pd.read_csv("data.csv")
Data Cleaning
Once the data is imported, the next step is to clean the data. This includes:
- Handling missing values
We can check for null values using.isnull()
and fill them in using.fillna()
- Removing duplicates
We can use.drop_duplicates()
- Converting data types
Using.astype()
Here are the key steps for data cleaning in exploratory data analysis:
Handle missing values
This is one of the most common issues with real-world data. You can handle missing values in a few ways:
- Drop rows with missing values
- Fill in missing values with the mean, median, or mode of the column
- Fill in missing values with a constant value (like 0)
The approach you take depends on the nature of the data and your analysis goals.
Remove duplicates
Duplicate rows can skew your analysis, so it’s important to identify and remove them. You can use pandas .duplicated() and .drop_duplicates() methods.
Check for outliers
Outliers can significantly impact your results, so you’ll want to identify them and either remove them or handle them appropriately. You can use box plots, histograms, and z-scores to find outliers.
Correct data types
Ensure all columns have the correct data type (string, integer, float, boolean, etc.). You may need to convert some columns using .astype().
Fix inconsistent data
Look for inconsistencies in the data, like spelling variations, different formats, etc., and standardize the data.
Address class imbalance (if relevant)
If you have a target variable with class imbalance, you may need to resample the data to get meaningful results.
Feature engineering
Derive new features from your existing data that might be useful for your analysis. This is an important step for creating an effective model.
Those are the main steps for cleaning your data as part of the exploratory data analysis process.
Data Preprocessing
Next, we perform some preprocessing:
- Renaming columns
Using.rename()
- Changing column order
Using.reindex()
- Dropping unnecessary columns
- Encoding categorical variables
key steps for data preprocessing in exploratory data analysis:
Rename columns
Give your columns meaningful and consistent names to make the data easier to work with. Use .rename() in pandas.
Reorder columns
Put the columns in a logical order so related columns are next to each other. Use .reindex() in pandas.
Drop unnecessary columns
Remove any columns that are not relevant to your analysis. Use .drop() in pandas.
Encode categorical variables
If you have categorical variables like ‘gender’ or ‘color’, encode them as integers to prepare for modeling. You can use label encoding, one-hot encoding, or target encoding.
Normalize/standardize numeric variables
If you have numeric features on different scales, normalize or standardize them to the same scale. This helps prevent features with larger ranges from dominating.
Impute missing values
Fill in any remaining missing values after your initial data cleaning. You can use mean, median or mode imputation.
Transform skewed variables
If you have variables with skewed distributions, apply a transformation like a log or square root to make them more normal. This helps many models work best.
Create new features
Derive additional features from your existing data that may be useful for your analysis. This is an important part of feature engineering.
Handle outliers
You may choose to cap outliers at a certain value, winsorize them, or remove them, depending on your analysis goals.
Those are some of the main techniques for preprocessing your data as part of exploratory data analysis. The goal is to transform your raw data into a form that’s easier to analyze and model.
Data Visualization
Now we can visualize the data to gain insights:
.hist()
for histograms.plot()
for line plots, scatter plots, etc..value_counts()
to see counts of categorical variables.describe()
for summary statistics
Main techniques for data visualization during exploratory data analysis:
Histograms
Use .hist() in pandas to get a visual representation of the distribution of a numeric variable. This can reveal outliers, skewness, and other patterns.
Box plots
Use .boxplot() in pandas to visualize the distribution through quartiles, extremes, and outliers for a numeric variable.
Scatter plots
Use .plot(kind=’scatter’) to visualize the relationship between two numeric variables. This can reveal correlations, clusters, and outliers.
Bar plots
Use .plot(kind=’bar’) to compare categorical variables or the counts of categorical variables. This gives a quick visual summary.
Line plots
Use .plot(kind=’line’) to visualize trends over time for time series data.
Pair plots
Use seaborn pairplot() to visualize the relationships between all variables in a dataset.
Correlation heatmaps
Use a seaborn heatmap() to visualize the correlation between all numeric variables.
Pie charts
Use .plot(kind=’pie’) to visualize the proportional breakdown of a categorical variable.
Word clouds
Generate a word cloud to visualize the most common words in a text column.
Descriptive statistics
Use .describe() to get summary stats like count, mean, standard deviation, minimum, and maximum for numeric columns.
These techniques help you gain a quick understanding of your data and reveal patterns, outliers, and relationships that you can then investigate further. Data visualization is a critical part of the exploratory data analysis process.
So we can understand all the above points Importing the Data, Data Cleaning, Data Preprocessing, and Data Visualization from one example:
Demonstration
Note
If you are looking to quickly set up and explore AI/ML & Python Jupyter Notebook Kit, Techlatest.net provides an out-of-the-box setup for AI/ML & Python Jupyter Notebook Kit on AWS, Azure and GCP. Please follow the below links for the step-by-step guide to set up the AI/ML & Python Jupyter Notebook Kit on your choice of cloud platform.
For AI/ML KIT: AWS, GCP & Azure.
Why did you choose Techlatest.net VM, AI/ML Kit & Python Jupyter Notebook?
- In-browser editing of code
- Ability to run and execute code in various programming languages
- Supports rich media outputs like images, videos, charts, etc.
- Supports connecting to external data sources
- Supports collaborative editing by multiple users
- Simple interface to create and manage notebooks
- Ability to save and share notebooks
During VM selection “We are selecting GPU instance by going to GPU tab and selecting the desired GPU instance type. GPU instance will give 10 to 15 times better performance compared to equivalent CPU instance, however, GPU instances will have a significantly higher cost, so choose the right instance type for your performance and budget requirement”.
I take all three examples of AWS, GCP & Azure for Your Reference.
After setup the VM, we can log in to Jupyter Hub, so below you can see step by step guide.
Step 1
This VM comes with the default Ubuntu as an admin user. So to access the Web UI and to install additional packages, log in with the Ubuntu user and the password you set during the first login to the Jupyter Notebook.
Step 2
Open a Terminal in your Jupyter Notebook and enter the below command to install the there package using pip.
sudo -E pip install there
Note: Don’t forget to use sudo in the above command.
Step 3
Import some libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # visualizing data
%matplotlib inline
import seaborn as sns
Step 4
Import the data, if your data is in a CSV file you can import it like this,
# import csv file
df = pd.read_csv('Diwali Sales Data.csv', encoding= 'unicode_escape')
Step 5
Once the data is imported, the next step is to clean the data.
Step 6
Next, we perform some preprocessing:
Step 7
Now we can visualize the data to gain insights:
# plotting a bar chart for Gender and it's count
ax = sns.countplot(x = 'Gender',data = df)
for bars in ax.containers:
ax.bar_label(bars)
# plotting a bar chart for gender vs total amount
sales_gen = df.groupby(['Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)
sns.barplot(x = 'Gender',y= 'Amount' ,data = sales_gen)
From the above graphs, we can see that most of the buyers are females and even the purchasing power of females is greater than men.
Age
ax = sns.countplot(data = df, x = 'Age Group', hue = 'Gender')
for bars in ax.containers:
ax.bar_label(bars)
Conclusion
The blog then provided a demonstration of the above steps using an example dataset, showcasing how to import the data, clean it, preprocess it, and visualize it. Finally, the blog concluded with a brief recommendation for using Techlatest.net’s VM, AI/ML Kit, and Python Jupyter Notebook for setting up an AI/ML development environment.
Overall, the blog aimed to provide a comprehensive overview of the exploratory data analysis process and highlight the importance of each step in understanding and preparing the data for further analysis and modeling.