Exploratory Data Analysis — What is it and why is it so important? (Part 1/2)

This is week 3 of my “52 Weeks of Data Science” series.

Terence Shin
Dec 13, 2019 · 6 min read

What is Exploratory Data Analysis?

Exploratory Data Analysis does two main things:

1. It helps clean up a dataset.

2. It gives you a better understanding of the variables and the relationships between them.

Components of EDA

1. Understanding Your Variables

#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
#Understanding my variables
df.shape
df.head()
df.columns
df.columns output
df.nunique(axis=0)
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))
df.nunique(axis=0) output
df.describe().apply(lambda s: s.apply(lambda x: format(x, ‘f’))) output
df.condition.unique()
df.condition.unique()
# Reclassify condition column
def clean_condition(row):

good = ['good','fair']
excellent = ['excellent','like new']

if row.condition in good:
return 'good'
if row.condition in excellent:
return 'excellent'
return row.condition
# Clean dataframe
def clean_df(playlist):
df_cleaned = df.copy()
df_cleaned['condition'] = df_cleaned.apply(lambda row: clean_condition(row), axis=1)
return df_cleaned
# Get df with reclassfied 'condition' column
df_cleaned = clean_df(df)
print(df_cleaned.condition.unique())
print(df_cleaned.condition.unique()) output

2. Cleaning your dataset

df_cleaned = df_cleaned.copy().drop(['url','image_url','city_url'], axis=1)
NA_val = df_cleaned.isna().sum()def na_filter(na, threshold = .4): #only select variables that passees the threshold
col_pass = []
for i in na.keys():
if na[i]/df_cleaned.shape[0]<threshold:
col_pass.append(i)
return col_pass
df_cleaned = df_cleaned[na_filter(NA_val)]
df_cleaned.columns
df_cleaned = df_cleaned[df_cleaned['price'].between(999.99, 99999.00)]
df_cleaned = df_cleaned[df_cleaned['year'] > 1990]
df_cleaned = df_cleaned[df_cleaned['odometer'] < 899999.00]
df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))
df_cleaned = df_cleaned.dropna(axis=0)
df_cleaned.shape

The Startup

Get smarter at building your thing. Join The Startup’s +799K followers.

Terence Shin

Written by

Data Scientist @ KOHO | Top 1000 Writer on Medium | MSc, MBA | https://terenceshin.com/

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Terence Shin

Written by

Data Scientist @ KOHO | Top 1000 Writer on Medium | MSc, MBA | https://terenceshin.com/

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store