# Exploratory Data Analysis — What is it and why is it so important? (Part 1/2)

## This is week 3 of my “52 Weeks of Data Science” series.

Dec 13, 2019 · 6 min read

# What is Exploratory Data Analysis?

Exploratory Data Analysis does two main things:

1. It helps clean up a dataset.

2. It gives you a better understanding of the variables and the relationships between them.

# Components of EDA

`#Import Librariesimport numpy as npimport pandas as pdimport matplotlib.pylab as pltimport seaborn as sns#Understanding my variablesdf.shapedf.head()df.columns`
`df.nunique(axis=0)df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))`
`df.condition.unique()`
`# Reclassify condition columndef clean_condition(row):        good = ['good','fair']    excellent = ['excellent','like new']               if row.condition in good:        return 'good'       if row.condition in excellent:        return 'excellent'        return row.condition# Clean dataframedef clean_df(playlist):    df_cleaned = df.copy()    df_cleaned['condition'] = df_cleaned.apply(lambda row: clean_condition(row), axis=1)    return df_cleaned# Get df with reclassfied 'condition' columndf_cleaned = clean_df(df)print(df_cleaned.condition.unique())`

`df_cleaned = df_cleaned.copy().drop(['url','image_url','city_url'], axis=1)`
`NA_val = df_cleaned.isna().sum()def na_filter(na, threshold = .4): #only select variables that passees the threshold    col_pass = []    for i in na.keys():        if na[i]/df_cleaned.shape[0]<threshold:            col_pass.append(i)    return col_passdf_cleaned = df_cleaned[na_filter(NA_val)]df_cleaned.columns`
`df_cleaned = df_cleaned[df_cleaned['price'].between(999.99, 99999.00)]df_cleaned = df_cleaned[df_cleaned['year'] > 1990]df_cleaned = df_cleaned[df_cleaned['odometer'] < 899999.00]df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))`
`df_cleaned = df_cleaned.dropna(axis=0)df_cleaned.shape`

