Step by Step — Run Exploratory Data Analysis
How can you draw relevant conclusions without knowing anything about the underlying data?
Let’s say you dream about building your own fishing cabin along the lake. In your garage, you find a bunch of tools, a pile of wooden planks, various nails, and screws. Would you rather:
a) Start immediately to build the walls, without checking if you have enough nails, a proper hammer or even if you have enough planks. Who cares? You will figure this out eventually in the middle of the process.
b) Or, take a moment to check your set of tools, sort the planks by size and state, perhaps dropping the rusty nails and going to the shop to buy new ones. And then, start to build the walls.
If you manage your impatience and are willing to work efficiently, you will most likely choose the option b). Because even if, this option delayed the moment you are starting to build your cabin, it ensures you have all what you need to fulfill your purpose and build strong foundations.
This is precisely the same with Exploratory Data Analysis. Before applying inferential statistics and drawing conclusions, it is always good to check the structure, format, distribution, and relationships between the variables. Summarizing the main characteristics and exploring the data — often through visuals, will support the modeling by ensuring that you have a reliable dataset or by leading you to collect more or better data.
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats
import seaborn as sns
%matplotlib inline# Read the data
df = pd.read_csv(Path.cwd()/'notes.csv')
The dataset is labelled and contains the following information:
- The length of the ticket (in mm);
- The height of the ticket (measured on the left/right side, in mm);
- The top and low margin between the edges the image (in mm);
- The diagonal (in mm);
- Label column (type boolean) defining fake or genuine banknotes.
Start by running an univariate analysis to get a better understanding of each variables: observing the distribution, finding patterns and detecting outliers, are some common steps for this analysis. The main purpose here is to describe the data. Let’s start.
Nothing looks odd: none missing values, none atypical values or negative (errors). The data format looks good.
# How many individuals do we have for each category?
The dataset is slightly imbalanced: we have 30 more individuals in the “genuine” set than in the “fake” set.
# Remove the boolean column
tmp = df.iloc[:,1:]# Check the Distribution for each columns
for i in tmp.columns:
tmp1 = df[df[‘is_genuine’] == True]
tmp2 = df[df[‘is_genuine’] == False]
plt.hist(tmp1[i], bins=50, alpha=0.5, label=”genuine”)
plt.hist(tmp2[i], bins=50, alpha=0.5, label=”fake”)
Examples of output:
The distributions for each variables look slightly different, whether or not, the banknote is genuine — except for the variable diagonal. For this variable, the distributions of the 2 categories are overlapping, which might indicates that this variable is not relevant to separate the 2 categories.
➡️ All in all, each variables seem to be normally distributed with a right or left skew. Let’s use the boxplot to see whether or not these values are suspected outliers.
# Create boxplots to visualize the potential outliers
fig, ax_new = plt.subplots(3,2, sharey=False,figsize=(20,17))
We can detect some outliers (extreme values) in both categories. It might be interesting to investigate further on this data points and see whether or not, we keep them in the dataset.
Machine learning modeling can be improved by understanding, handling, or removing the outlier values. One method for removing the outliers when the data are following a Gaussian-like distribution consist of:
- Using a threshold equals to three times the Standard Deviation.
- A value that falls outside of the range is part of the distribution, but it is an unlikely or rare event.
- For smaller samples of data, you might prefer to use a smaller threshold like, for instance, two times the Standard Deviations, and four times the Standard Deviations for a larger dataset.
This is a standard cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. Another similar practice for removing outliers, if the data are not normally distributed, consist of:
- Using the InterQuartile Range (IQR) and a threshold to specify a limit high and low to identify outliers.
- We often apply a threshold of 1.5 below the first quartile factor or above the third quartile.
- To target the extreme outliers, you can use a larger threshold (e.g., three or more)
How to Remove Outliers for Machine Learning - Machine Learning Mastery
When modeling, it is important to clean the data sample to ensure that the observations best represent the problem…
# Create a function to identify the outliers for each features and for each categoriesdef get_outliers(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 — Q1
df_out = df[((df < (Q1–1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
return df_out# Apply the function at labels level
We count 14 outliers in total: 9 for the fake banknotes and 6 for the genuine banknotes. The questions now are:
- Can they affect the accuracy of the detection algorithm?
- Are they due to a measurement error or data entry error?
- Are they (or not) a normal part of the population?
The truth is, having outliers on the group of fake banknotes is something we could expect. By definition these banknotes are not genuine so they are not following any standards in terms of dimension. So, to the question: “Are they a normal part of the population?” We could answer: most likely.
To the contrary, having outliers on the genuine banknotes is not normal. Banknotes are supposed to follow rigorous standards in terms of dimensions, so we shouldn’t consider as a normal part of the population. They could be due to measurement errors or a data entries errors.
➡️ We suggest to keep the outliers on the fake set and remove them from the genuine set.
In the previous part, we visualized the structure and distribution within a single column. In this part, we’ll expand the exploration by visualizing the relationships between two variables at a time to understand:
How variables interact with one another?
Does an increase in one variable correlate with an increase/decrease in another?
# Visualize the correlation & distribution of the variables
We can also visualize the correlations on a heatmap, as follow:
# Heatmap of correlation matrix
sns.heatmap(df.corr(),annot = True)
The coefficient of correlation between the variable diagonal and the label column “is_genuine” is very low. It indicates that this variable is not relevant to predict if a banknote is genuine or not. So, we can drop this variable before to move on to the modeling. The others variables, and in particular the low margin, are correlated to the label column.
You have now performed an Exploratory Data Analysis (EDA) with a strong focus on:
- Univariate Analysis
- Bivariate Analysis
- and Visualizations of the outliers.
Thanks for reading! If you enjoyed this article, make sure to hold the button clap 👏🏻 to support my writing. Follow my work and inspirations on Linkedin.