20% of EDA Plots Data Scientists Use 80% of the Time

All the plots you need to know for EDA

Anjolaoluwa Ajayi
GDSC Babcock Dataverse
5 min readDec 22, 2023

--

Various EDA Plots cc: author

What is EDA?

Exploratory Data Analysis (EDA), according to IBM, is a method used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization techniques.

So we can say EDA’s the process of investigating and understanding your dataset by creating visualizations and summaries.

Why do we Need EDA?

Honestly, EDA is so important in the data science/ machine learning workflow that the real question should be “what would we do without EDA!?”

Before doctors prescribe drugs or treatments to patients, they always run a couple of tests, ask a bunch of questions, and all of that.

Data scientists are like doctors, except instead of patients, we’re dealing with data.

EDA is our way of asking the data questions to find out everything we can about it and understand why it is the way it is (i.e. identifying trends, patterns, anomalies, etc.).

Now instead of drugs and treatments, we’re trying to decide on the best models and features to use on/ from our data respectively.

So the information gathered from EDA helps us with this. And these, my friend, are the major reasons why we need EDA as data scientists.

Note:

For the sake of this blog post, we’ll be using:

  • Seaborn and matplotlib library
  • ‘Tips’ dataset from Seaborn

Now, we look at the few out of many available plots data scientists use every time.

1. Bar Plot / Count Plot

Image by author from code

Used for:

  • Displaying the distribution of categorical variables.
  • Visualizing the frequency or count of each category in a dataset.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt


data = sns.load_dataset('tips')
sns.countplot(x='day', data=data)
plt.title('Count of Tips by Day')
plt.show()

2. Box Plot

Image by author from code

Used for:

  • Displaying the mean, median, quantiles, and outliers in data.
  • Comparing the distribution of multiple variables.
  • Identifying the spread of numerical variables.
  • Detecting potential outliers in the dataset.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')
sns.boxplot(x='day', y='total_bill', data=data)
plt.title('Box Plot of Total Bill by Day')
plt.show()

3. Density Plot

Pro tip: We’re data scientists, we use density plots instead of histograms because we hate guessing/ deciding the optimum number of bins.

Image by author from code

Used for:

  • Visualizing the distribution of a continuous variable.
  • Identifying peaks, valleys, and overall patterns in the data.
  • Understanding the shape of the distribution.
  • Comparing the distributions of multiple variables.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')
sns.kdeplot(data['total_bill'], shade=True)
plt.title('Density Plot of Total Bill')
plt.show()

4. Scatter Plot

Image by author from code

Used for:

  • Exploring the relationship between two continuous variables.
  • Identifying patterns, correlations, or clusters in the data.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')
sns.scatterplot(x='total_bill', y='tip', data=data)
plt.title('Scatter Plot of Total Bill vs. Tip')
plt.show()

5. Line Plot

Image by author from code

Used for:

  • Displaying the trend or pattern in a time series.
  • Showing the relationship between two continuous variables over a continuous interval.
  • Comparing changes in variables over a continuous range.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')
sns.lineplot(x='total_bill', y='tip', data=data)
plt.title('Line Plot of Tip Over Total Bill')
plt.show()

6. Heatmap

Image by author from code

Used for:

  • Displaying the correlation matrix of numerical variables.
  • Identifying patterns and relationships in a large dataset.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

Bonus: Combining Plots

Ever heard of the saying ‘kill two birds with one stone’?

Well, we sort of do that a lot when it comes to EDA by employing plots that are actually just a combination of the plots discussed above.

We do this in an attempt to ‘save time’ but let’s be real, decent EDA takes a lot of time regardless.

Now let’s save that talk for another day and get to it already, shall we?

6. Subplot

Image by author from code

Used for: Comparing multiple plots side by side within the same figure.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')

plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
sns.scatterplot(x='total_bill', y='tip', data=data)
plt.title('Scatter Plot of Total Bill vs Tip')

plt.subplot(2, 2, 2)
sns.boxplot(x='day', y='total_bill', data=data)
plt.title('Box Plot of Total Bill by Day')

plt.subplot(2, 2, 3)
sns.barplot(x='day', y='total_bill', data=data)
plt.title('Bar Plot of Total Bill by Day')

plt.subplot(2, 2, 4)
sns.histplot(data['total_bill'], kde=True)
plt.title('Histogram of Total Bill')

plt.tight_layout()
plt.show()

8. Pairplot

Image by author from code

Used for: Exploring correlations and trends between multiple variables by visualizing them in pairs.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')

sns.pairplot(data, hue='day')
plt.suptitle('Pairplot of Numerical Variables by Day', y=1.02)
plt.show()

9. Violin Plot

Combines the features of box plots and kernel density plots.

Image by author from code

Used for: Visualizing the distribution of a numerical variable across different categories.

Used like this:

import seaborn as sns
import matplotlib.pyplot as plt

data = sns.load_dataset('tips')
sns.violinplot(x='day', y='total_bill', data=data)
plt.title('Violin Plot of Total Bill by Day')
plt.show()

Conclusion/C-T-A

And that’s a wrap :)

If you gained a thing or two, please leave as many claps as you can (up to 50) so other data scientists get to see this.

For more articles like this, make sure to follow GDSC Babcock DataVerse Publication

Bye for now!

--

--

Anjolaoluwa Ajayi
GDSC Babcock Dataverse

Data Scientist @EY. I'm a big data fiend (no pun intended ><). I mostly write about Data Science, ML, and Gen AI. Might write a book soon ;)