# Exploratory Data Analysis — What is it and why is it so important? (Part 1/2)

## This is week 3 of my “52 Weeks of Data Science” series.

Dec 13, 2019 · 6 min read

# What is Exploratory Data Analysis?

Exploratory Data Analysis does two main things:

1. It helps clean up a dataset.

2. It gives you a better understanding of the variables and the relationships between them.

# Components of EDA

`#Import Librariesimport numpy as npimport pandas as pdimport matplotlib.pylab as pltimport seaborn as sns#Understanding my variablesdf.shapedf.head()df.columns`
`df.nunique(axis=0)df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))`
`df.condition.unique()`
`# Reclassify condition columndef clean_condition(row):        good = ['good','fair']    excellent = ['excellent','like new']               if row.condition in good:        return 'good'       if row.condition in excellent:        return 'excellent'        return row.condition# Clean dataframedef clean_df(playlist):    df_cleaned = df.copy()    df_cleaned['condition'] = df_cleaned.apply(lambda row: clean_condition(row), axis=1)    return df_cleaned# Get df with reclassfied 'condition' columndf_cleaned = clean_df(df)print(df_cleaned.condition.unique())`

`df_cleaned = df_cleaned.copy().drop(['url','image_url','city_url'], axis=1)`
`NA_val = df_cleaned.isna().sum()def na_filter(na, threshold = .4): #only select variables that passees the threshold    col_pass = []    for i in na.keys():        if na[i]/df_cleaned.shape[0]<threshold:            col_pass.append(i)    return col_passdf_cleaned = df_cleaned[na_filter(NA_val)]df_cleaned.columns`
`df_cleaned = df_cleaned[df_cleaned['price'].between(999.99, 99999.00)]df_cleaned = df_cleaned[df_cleaned['year'] > 1990]df_cleaned = df_cleaned[df_cleaned['odometer'] < 899999.00]df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))`
`df_cleaned = df_cleaned.dropna(axis=0)df_cleaned.shape`

## The Startup

Get smarter at building your thing. Join The Startup’s +799K followers.

Written by

## Terence Shin

Data Scientist @ KOHO | Top 1000 Writer on Medium | MSc, MBA | https://terenceshin.com/

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Written by

## Terence Shin

Data Scientist @ KOHO | Top 1000 Writer on Medium | MSc, MBA | https://terenceshin.com/

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

## Latent Dirichlet Allocation for Beginners: A high level intuition

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app