Exploratory Data Analysis — What is it and why is it so important? (Part 1/2)
This is week 3 of my “52 Weeks of Data Science” series.
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.
‘Understanding the dataset’ can refer to a number of things including but not limited to…
- Extracting important variables and leaving behind useless variables
- Identifying outliers, missing values, or human error
- Understanding the relationship(s), or lack of, between variables
- Ultimately, maximizing your insights of a dataset and minimizing potential error later in the process
Here’s why this is important.
Have you heard of the phrase, “garbage in, garbage out”?
With EDA, it’s more like, “garbage in, perform EDA, possibly garbage out.”
By conducting EDA, you can turn an almost useable dataset into a completely useable dataset. I’m not saying that EDA can magically make any dataset clean — that is not true. However, many EDA techniques can remedy some common problems that are present in every dataset.
Exploratory Data Analysis does two main things:
1. It helps clean up a dataset.
2. It gives you a better understanding of the variables and the relationships between them.
Components of EDA
To me, there are main components of exploring data:
- Understanding your variables
- Cleaning your dataset
- Analyzing relationships between variables
In this article, we’ll take a look at the first two components.
1. Understanding Your Variables
You don’t know what you don’t know. And if you don’t know what you don’t know, then how are you supposed to know whether your insights make sense or not? You won’t.
To give an example, I was exploring data provided by the NFL (data here) to see if I could discover any insights regarding variables that increase the likelihood of injury. One insight that I got was that Linebackers accumulated more than eight times as many injuries as Tight Ends. However, I had no idea what the difference between a Linebacker and a Tight End was, and because of this, I didn’t know if my insights made sense or not. Sure, I can Google what the differences between the two are, but I won’t always be able to rely on Google! Now you can see why understanding your data is so important. Let’s see how we can do this in practice.
As an example, I used the same dataset that I used to create my first Random Forest model, the Used Car Dataset here. First, I imported all of the libraries that I knew I’d need for my analysis and conducted some preliminary analyses.
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns#Understanding my variables
.shape returns the number of rows by the number of columns for my dataset. My output was (525839, 22), meaning the dataset has 525839 rows and 22 columns.
.head() returns the first 5 rows of my dataset. This is useful if you want to see some example values for each variable.
.columns returns the name of all of your columns in the dataset.
Once I knew all of the variables in the dataset, I wanted to get a better understanding of the different values for each variable.
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))
.nunique(axis=0) returns the number of unique values for each variable.
.describe() summarizes the count, mean, standard deviation, min, and max for numeric variables. The code that follows this simply formats each row to the regular format and suppresses scientific notation (see here).
Immediately, I noticed an issue with price, year, and odometer. For example, the minimum and maximum price are $0.00 and $3,048,344,231.00 respectively. You’ll see how I dealt with this in the next section. I still wanted to get a better understanding of my discrete variables.
Using .unique(), I took a look at my discrete variables, including ‘condition’.
You can see that there are many synonyms of each other, like ‘excellent’ and ‘like new’. While this isn’t the greatest example, there will be some instances where it‘s ideal to clump together different words. For example, if you were analyzing weather patterns, you may want to reclassify ‘cloudy’, ‘grey’, ‘cloudy with a chance of rain’, and ‘mostly cloudy’ simply as ‘cloudy’.
Later you’ll see that I end up omitting this column due to having too many null values, but if you wanted to re-classify the condition values, you could use the code below:
# Reclassify condition column
good = ['good','fair']
excellent = ['excellent','like new']
if row.condition in good:
if row.condition in excellent:
return row.condition# Clean dataframe
df_cleaned = df.copy()
df_cleaned['condition'] = df_cleaned.apply(lambda row: clean_condition(row), axis=1)
return df_cleaned# Get df with reclassfied 'condition' column
df_cleaned = clean_df(df)print(df_cleaned.condition.unique())
And you can see that the values have been re-classified below.
2. Cleaning your dataset
You now know how to reclassify discrete data if needed, but there are a number of things that still need to be looked at.
a. Removing Redundant variables
First I got rid of variables that I thought were redundant. This includes url, image_url, and city_url.
df_cleaned = df_cleaned.copy().drop(['url','image_url','city_url'], axis=1)
b. Variable Selection
Next, I wanted to get rid of any columns that had too many null values. Thanks to my friend, Richie, I used the following code to remove any columns that had 40% or more of its data as null values. Depending on the situation, I may want to increase or decrease the threshold. The remaining columns are shown below.
NA_val = df_cleaned.isna().sum()def na_filter(na, threshold = .4): #only select variables that passees the threshold
col_pass = 
for i in na.keys():
return col_passdf_cleaned = df_cleaned[na_filter(NA_val)]
c. Removing Outliers
Revisiting the issue previously addressed, I set parameters for price, year, and odometer to remove any values outside of the set boundaries. In this case, I used my intuition to determine parameters — I’m sure there are methods to determine the optimal boundaries, but I haven’t looked into it yet!
df_cleaned = df_cleaned[df_cleaned['price'].between(999.99, 99999.00)]
df_cleaned = df_cleaned[df_cleaned['year'] > 1990]
df_cleaned = df_cleaned[df_cleaned['odometer'] < 899999.00]df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))
You can see that the minimum and maximum values have changed in the results below.
d. Removing Rows with Null Values
Lastly, I used .dropna(axis=0) to remove any rows with null values. After the code below, I went from 371982 to 208765 rows.
df_cleaned = df_cleaned.dropna(axis=0)
And that’s it for now! In the second part, we’ll cover exploring the relationship between variables through visualizations. (Click here for part 2.)
You can see my Kaggle Notebook here.