EDA: Exploratory Data Analysis

Exploring, Questioning, Cleaning and Documenting

Nicole Scott

Published in

The Startup

5 min readAug 27, 2019

EDA: It’s what we do as Data Scientists.

So much has already been written about Exploratory Data Analysis (EDA), but I am compelled to weigh in further because it is such an integral part of the overall Data Science process. And because there are consequences for neglecting this key step!

In The Ultimate Guide to Data Cleaning, Omar Elgabry puts forth a cautionary tale:

“. . . You ingested a bunch of dirty data, didn’t clean it up, and you told your company to do something with these results that turn out to be wrong. You’re going to be in a lot of trouble! Incorrect or inconsistent data leads to false conclusions. And so, how well you clean and understand the data has a high impact on the quality of the results.”

As outlined in the image below, two-thirds of the Data Science Hierarchy of Needs is comprised of some EDA-related task!

EDA Defined

The National Institute of Standards and Technology (NIST) describes EDA as an approach/philosophy for data analysis that employs a variety of techniques to:

maximize insight into a data set;
uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions;
develop parsimonious models; and
determine optimal factor settings

EDA Process

While no two EDA efforts are exactly alike, there are a few basic (and iterative) steps to follow when becoming familiar with a new dataset.

Explore / Question
Clean / Verify
Document / Save

1. Explore / Question

Don’t explore alone! Like any worthwhile adventure, this one is better with the proper guides and companions. As you seek to understand your dataset and it’s original intent, a data dictionary, codebook, or entity-relationship (ER) diagram can be a valuable reference. If you’re not sure where to find these resources, ask a Data Engineer or Database Administrator (DBA) at your organization. If your data was scraped from a website, you should refer to any documentation provided with the website API. Additionally, you can use one of Google’s OpenDev Tools via Chrome to learn more about individual data elements:

right-click an element on the page and select Inspect to jump into the Elements panel. Or press Command+Option+C (Mac) or Control+Shift+C (Windows, Linux, Chrome OS).

In this step you are looking for missing data (aka “null” or “NaN” values) and other anomalies like duplicates, incorrect data types or otherwise suspicious looking observations. It’s also a good idea to take a look at the distribution of data values for outliers and skewness.

The code below (provided by Adi Bronshtein) includes a few of my favorite go-to commands for initial exploration of any dataset.

# pandas function creates a report from several common EDA commands
def eda(dataframe):
 print(“missing values: {}”.format(dataframe.isnull().sum()))
 print(“dataframe index: {}”.format(dataframe.index))
 print(“dataframe types: {}”.format(dataframe.dtypes))
 print(“dataframe shape: {}”.format(dataframe.shape))
 print(“dataframe describe: {}”.format(dataframe.describe()))for item in dataframe:
 print(item)
 print(dataframe[item].nunique())

After creating the function above you can pass in the name of your dataframe to generate the EDA report, for example eda(df_titanic).

Another excellent resource is provided by Lukas Frei in Speed Up Your Exploratory Data Analysis With Pandas-Profiling which describes in detail the pandas_profiling package.

Additional outputs from this first phase may include some basic exploratory visualizations. For example, the code below will generate a bar chart showing how many missing values are in each column of the train dataframe.

# imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline# plot missing data:
train.isnull().sum().plot(kind='bar')
# Add a title and show the plot.
plt.title('Number of Missing Values Per Column')
# Create tick mark labels on the Y axis and rotate them.
plt.xticks(rotation = 45)
# Create X axis label.
plt.xlabel("Columns")
# Create Y axis label.
plt.ylabel("NaN Values");

2. Clean / Verify

Once you have a good grasp of the problems within your data, you can begin to deal with them. In general you’ll need to utilize some method of fixing, imputing, or removing problematic data. Some key pandas methods for cleaning some of the common issues mentioned above include:

.dropna() — drop NaN values
.fillna() — impute NaN values
.drop_duplicates()— drop duplicate values
.astype()— change a column data type

*A note on EDA intended output

Not every problem is the same, so not every EDA exercise will be either. Beyond the basic quality and cleanliness of your dataset (as Omar Elgabry states), “Understanding what are you trying to accomplish, your ultimate goal is critical prior to taking any actions.”. A clear understanding of the purpose and intended use case for your final dataset will inform any additional EDA tasks you may wish to perform.

For example if your goal is to run your data through a machine learning algorithm to solve a binary classification problem (i.e. predict an outcome of yes/no, is likely/not likely, etc.), you’ll need to perform some type of “preprocessing” on your target (predictor) variable to assign binary values to your positive (1) and negative (0) classes. If your data includes “categorical” data types, you’ll need to convert them to numerical values prior to machine learning. A common pandas method to accomplish this is:

.get_dummies() — convert categorical variables

3. Document / Save

Finally, you’ll want to use some technique to save your clean dataset in a safe place. For example, the code below can be used to export an entire dataframe to a .csv file.

# export dataframe to .csv
df.to_csv('export_2018_pricelist.csv', index=False)

Just as important as saving your clean dataset, is documenting the cleaning steps you took, as well as justifications for the decisions you made. Outputs from this phase may include an export file of your clean dataset and a Jupyter Notebook with a README.md file (or other summary report) to support your analysis.

Conclusion

I hope this brief overview has served as a reminder of the importance of Exploratory Data Analysis within the overall scope of Data Science projects. While by no means an exhaustive list, this post has covered the basic steps and iterative nature of the EDA workflow, as well as links to additional resources and documentation.

Thanks for reading and…