5 Minute EDA: How to explore data when you don’t know what you don’t know

Aya Spencer
5 Minute EDA
Published in
3 min readMar 12, 2022
Photo by marianne bos on Unsplash

After you finish cleaning and wrangling your data, the next step is usually to start exploring the data. What if, in this step, you realize that you know absolutely nothing about the data that you are about to explore?

I was recently playing around with a dataset about cars, which included various metrics such as miles per gallon, the horsepower and the model year & brand. I personally know nothing about cars. Not a damn thing. At times like this, exploratory data analysis can seem especially difficult because my baseline knowledge of the data is minimal. In this situation, a good trick is to leverage a pairplot — a wonderful function within the seaborn library.

What is a pairplot?

According to python.org,

A pairplot plot a pairwise relationships in a dataset. The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column.

In other words, pairplot allows you to instantly see the relationships amongst every variable in a dataset. It’s easier to show than to explain in words, so let’s jump to the code:

Source & Method

Kaggle has a dataset about cars. It includes information such as the car’s brand, miles per gallon, and manufactured year. I used this as my base to run my pairplot.

Prepare Data

Let’s import the base data:

df=pd.read_csv("cars_updated.csv")

I notice that some columns have leading and trailing spaces, so let’s get rid of those first:

df = df.rename(columns=lambda x: x.strip())

Perfect.

Now, as I’ve mentioned earlier, I know nothing about cars. I don’t know where to start with my exploratory analysis, so I’m going to use the pariplot function to map all the fields to uncover some relationships:

import seaborn as sns
sns.pairplot(df)

The diagonal bars that you see is a histogram of the variable. In other words, it shows the distribution for that input. Looking at mpg, for example, you can see that majority of the cars in the dataset skews towards 20 mpg, rather than 40mpg.

You can even group the pairplot by adding a “hue” condition. I’m going to group my plot by brand, and add some color by selecting the “rainbow” palette:

sns.pairplot(df, hue="brand", palette='rainbow')

Now you can see the distributions and patterns by brand! On to some discoveries:

  1. You can see that the dataset has older US & European cars and newer Japanese cars.
  2. While the general trend shows a decrease in miles per gallon as an increase in horsepower, majority of that trend is driven by US cars, while European cars show lower horsepower at somewhat variable miles per gallon.

Pairplots are useful for finding unique trends like this. It helps to formulate deeper questions that you can then run further diligence and analytics. I recommend playing around with a few pairplot variations to find general trends and patterns before diving in to model building.

This is part of my 5-minute EDA series, where I run quick exploratory data analysis on an interesting dataset. Thanks for reading!

--

--