Exploratory Data Analysis — What is it and why is it so important? (Part 1/2)

This is week 3 of my “52 Weeks of Data Science” series.

Terence Shin
Dec 13, 2019 · 6 min read

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.

‘Understanding the dataset’ can refer to a number of things including but not limited to…

  • Extracting important variables and leaving behind useless variables

Here’s why this is important.

Have you heard of the phrase, “garbage in, garbage out”?

With EDA, it’s more like, “garbage in, perform EDA, possibly garbage out.”

By conducting EDA, you can turn an almost useable dataset into a completely useable dataset. I’m not saying that EDA can magically make any dataset clean — that is not true. However, many EDA techniques can remedy some common problems that are present in every dataset.

Exploratory Data Analysis does two main things:

1. It helps clean up a dataset.

2. It gives you a better understanding of the variables and the relationships between them.

Components of EDA

To me, there are main components of exploring data:

  1. Understanding your variables

In this article, we’ll take a look at the first two components.

1. Understanding Your Variables

You don’t know what you don’t know. And if you don’t know what you don’t know, then how are you supposed to know whether your insights make sense or not? You won’t.

To give an example, I was exploring data provided by the NFL (data here) to see if I could discover any insights regarding variables that increase the likelihood of injury. One insight that I got was that Linebackers accumulated more than eight times as many injuries as Tight Ends. However, I had no idea what the difference between a Linebacker and a Tight End was, and because of this, I didn’t know if my insights made sense or not. Sure, I can Google what the differences between the two are, but I won’t always be able to rely on Google! Now you can see why understanding your data is so important. Let’s see how we can do this in practice.

As an example, I used the same dataset that I used to create my first Random Forest model, the Used Car Dataset here. First, I imported all of the libraries that I knew I’d need for my analysis and conducted some preliminary analyses.

#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
#Understanding my variables

.shape returns the number of rows by the number of columns for my dataset. My output was (525839, 22), meaning the dataset has 525839 rows and 22 columns.

.head() returns the first 5 rows of my dataset. This is useful if you want to see some example values for each variable.

.columns returns the name of all of your columns in the dataset.

df.columns output

Once I knew all of the variables in the dataset, I wanted to get a better understanding of the different values for each variable.

df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

.nunique(axis=0) returns the number of unique values for each variable.

.describe() summarizes the count, mean, standard deviation, min, and max for numeric variables. The code that follows this simply formats each row to the regular format and suppresses scientific notation (see here).

df.nunique(axis=0) output
df.describe().apply(lambda s: s.apply(lambda x: format(x, ‘f’))) output

Immediately, I noticed an issue with price, year, and odometer. For example, the minimum and maximum price are $0.00 and $3,048,344,231.00 respectively. You’ll see how I dealt with this in the next section. I still wanted to get a better understanding of my discrete variables.


Using .unique(), I took a look at my discrete variables, including ‘condition’.


You can see that there are many synonyms of each other, like ‘excellent’ and ‘like new’. While this isn’t the greatest example, there will be some instances where it‘s ideal to clump together different words. For example, if you were analyzing weather patterns, you may want to reclassify ‘cloudy’, ‘grey’, ‘cloudy with a chance of rain’, and ‘mostly cloudy’ simply as ‘cloudy’.

Later you’ll see that I end up omitting this column due to having too many null values, but if you wanted to re-classify the condition values, you could use the code below:

# Reclassify condition column
def clean_condition(row):

good = ['good','fair']
excellent = ['excellent','like new']

if row.condition in good:
return 'good'
if row.condition in excellent:
return 'excellent'
return row.condition
# Clean dataframe
def clean_df(playlist):
df_cleaned = df.copy()
df_cleaned['condition'] = df_cleaned.apply(lambda row: clean_condition(row), axis=1)
return df_cleaned
# Get df with reclassfied 'condition' column
df_cleaned = clean_df(df)

And you can see that the values have been re-classified below.

print(df_cleaned.condition.unique()) output

2. Cleaning your dataset

You now know how to reclassify discrete data if needed, but there are a number of things that still need to be looked at.

a. Removing Redundant variables

First I got rid of variables that I thought were redundant. This includes url, image_url, and city_url.

df_cleaned = df_cleaned.copy().drop(['url','image_url','city_url'], axis=1)

b. Variable Selection

Next, I wanted to get rid of any columns that had too many null values. Thanks to my friend, Richie, I used the following code to remove any columns that had 40% or more of its data as null values. Depending on the situation, I may want to increase or decrease the threshold. The remaining columns are shown below.

NA_val = df_cleaned.isna().sum()def na_filter(na, threshold = .4): #only select variables that passees the threshold
col_pass = []
for i in na.keys():
if na[i]/df_cleaned.shape[0]<threshold:
return col_pass
df_cleaned = df_cleaned[na_filter(NA_val)]

c. Removing Outliers

Revisiting the issue previously addressed, I set parameters for price, year, and odometer to remove any values outside of the set boundaries. In this case, I used my intuition to determine parameters — I’m sure there are methods to determine the optimal boundaries, but I haven’t looked into it yet!

df_cleaned = df_cleaned[df_cleaned['price'].between(999.99, 99999.00)]
df_cleaned = df_cleaned[df_cleaned['year'] > 1990]
df_cleaned = df_cleaned[df_cleaned['odometer'] < 899999.00]
df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

You can see that the minimum and maximum values have changed in the results below.

d. Removing Rows with Null Values

Lastly, I used .dropna(axis=0) to remove any rows with null values. After the code below, I went from 371982 to 208765 rows.

df_cleaned = df_cleaned.dropna(axis=0)

And that’s it for now! In the second part, we’ll cover exploring the relationship between variables through visualizations. (Click here for part 2.)

You can see my Kaggle Notebook here.

The Startup

Medium's largest active publication, followed by +583K people. Follow to join our community.

Terence Shin

Written by

Aspiring Data Scientist. Hoping that I can inspire others with a non-technical background to get into data science as well!

The Startup

Medium's largest active publication, followed by +583K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade