two people hang gliding over lake, camera is looking down on people
Just like EDA, source

Your Guide to Exploratory Data Analysis

tanta base
8 min readDec 10, 2023

Exploratory Data Analysis (EDA) is an important step in the machine learning process. You can think of EDA as if you are 1000 feet above the data looking down on it (see above). EDA allows you to view large patterns and trends within the data. You can use EDA to understand what direction the data is pointing you into it. More importantly, it gives you a chance to assess your data to know if it’s even usable for analysis or machine learning.

In this article we’ll review the most common visualizations for EDA and some of their variants. This article excludes geospatial data. Some unsupervised learning techniques (PCA, t-SNE, etc.) can be used for EDA, we’ll keep it to non-machine learning methods for this article.

As an aside, as a data engineer or machine learning engineer you are almost always expected to give presentations to a large audience. The visualizations we are going to review can also be used to educate your audience about the data you’ve been working with.

TL;DR

Histogram for numerical data, used to see the frequency distribution of your dataset, not a bar chart.

Line graph uses lines to connect usually discrete data points over time, used to understand how your data changes over time

Time series are for discrete samples taken over discrete points in time to see the trend over time.

Scatter plot is for numerical data to determine patterns or correlation between two variables.

Pie chart is typically used to divide categorical data into subsets represented by percentages of the whole data set. This chart helps to aggregate your data.

Bar plots are for visualizing the relationship between numerical and categorical data. They are usually used for comparison.

Heat maps display the magnitude as a color between two variables to understand the relationship between those two variables. They are typically used to understand correlation.

Treemaps arranges your data in a hierarchal view with subsets of your data nested within a larger boxes. They are typically used for aggregation.

Box-and-whisker plot shows the numerical data arranged into quartiles or boxes. If outliers exists they are known as the whiskers. Used to see the data distribution.

Word Cloud is used for qualitative data, specifically words, to see the frequency/significance of words within a corpus. This is usually used for comparisons or trends.

Time series is for discrete data points to understand patterns, trends over time

Let’s do a quick review of the data types:

image by author

Histogram

Histograms are typically used to visualize numerical data in order to understand the frequency distribution of your dataset. Data is organized into bins or ranges, where the x axis is the count of data points in that bin and the y axis is the bin range. They are very useful to get an idea of how much data points occur within a range and if data is missing from a specified range.

Note that histograms and bar charts look similar but are not the same thing. Bar charts are used for comparison, where histograms are used to assess the distribution of your data. For example, in a bar chart you could compare each bar to each other, but histograms are used to zoom out further, understand the amount of data in a given range and see your data as a whole.

Below are some examples, histograms and take on various shapes:

symmetrical shape: source
skewed right: source
skewed left: source
similar to a histogram, a stem and leaf plot can be used to see the distribution of qualitative data, source

Line Graph

This uses lines to connect individual quantitative (usually discrete) data points over a specified time interval in order to assess changes to your data over time or to understand relationships between values. The line is there to aid in the visual representation of your data. The y axis is usually the time and the x axis is usually the a count or quantity. Example below:

can plot many lines: source
or one: source

Scatter Plot

With a scatter plot you can visualize correlation and patterns within two quantitive data points. Usually, the positive of one dot indicates the value for data points on the x and y axis. Relationships of variables or correlations can be describes as positive or negative, weak or strong, and linear or nonlinear.

They can be great to see outliers, gaps in the data and to assess if data can be further segmented. An issue with scatter plots is if you have too much data it can look like one big mass or blob, where patterns/correlations can be hard to determine, this is also known as overplotting.

scatter plot: a million little pieces, source

Pie Chart

This is to see all of your data in divided subsetted portions, where each portion is a named category and represented by a percentage. The data used for this chart is typically categorical. A pie chart is most commonly used to visualize how each subset contributes to the entire dataset. Pie charts with lots of small slices can be difficult to read, if that’s the case you may need to make some configurations to your pie chart in order to create a better visualization. With that said, on to the example!

source
can also use a donut chart! source

Bar Plot

The bar plot, otherwise known as not a histogram. This is typically used to see the relationship between numerical and categorical data. The x axis is the label of your categorical data and the y axis is a count or a numerical value. With a bar plot you can see how much of a category of data you have or to compare groups to each other in order to rank them. Example below:

source

Heat Map

Usually the crowd pleaser and prettiest to look at, on to the heat map! We will be discussing a grid heat map, where a graphical representation of data in a two-dimensional matrix can be used to visualize the magnitude as a color between two separate values. They can be used to determine correlations or relationships within your dataset. They can be used with both numerical and categorical data.

can never go wrong with a heat map, source
my first introduction to heat maps, source

Treemap

This type of graph uses nested rectangles to represent categorical data within a dimension and arranged to a hierarchy. Essentially, your data is subsetted into smaller boxes within a whole box. Tree maps allow you to easily evaluate your entire data and subsets that exist within the data.

also another crowd pleaser, source

Box-and-whisker Plot

Just like the histogram, this is a really great way to get that 1000 feet view of your data. This is a graphical representation of your data distribution that shows the medium, min/max of your data through quartiles (boxes) and outliers (whiskers). The spacing found with the subsections show the spread and skewness of the data. Below are the different parts:

  • minimum/first quartile/0th percentile is the lowest data point, outside of outliers
  • maximum/fourth quartile/100th percentile is the highest data point, outside of outliers
  • median/second quartile/50th percentile is middle value of the data set
  • first quartile/25th percentile/lower quartile is the median of the lower half of the data set
  • third quartile/75th percentile is the median of the upper quartile
source
can also use violin plots to see density around each quartile source

Word Cloud

This is a good place to start if you are doing some type of natural language processing or analyzing word data within a large text or corpus. Also, another crowd pleaser if you are doing a presentation on text data. This is a graphical representation of word data, where single words are depicted and often times, the frequency/significance of the word is represented by the size/color of the word. Example below:

source
i only see a word cloud, source

Time Series Analysis

Similar to a line chart, I typically would use this for larger data sets over longer lengths of time. Time series data is discrete samples taken over discrete points in time to see the trend over time. This allows you to see fluctuations within your data during small time ranges and overall trends during larger time ranges. Example below:

time series data
time series data source

You can also view seasonality within your time series data, so if you have a cluster of events that peak around a given time period. Below is an example of what time series data looks like:

seasonality in time series data source

A feature of time series data is noise, which is random or unpredictable events that occur, example below:

noise in time series data
noise in time series data source

To further analyze your time series data you can do a Decomposition procedure. Which is used better understand the trend and seasonal factors in a time series. For example, if you want to remove the seasonality from your dataset to better understand the trends, this is known as a seasonally adjusted value.

There are two Decomposition models:

  • Additive: Trend + Seasonal + Noise, this is used when the seasonality is constant over time. Example below:
additive time series
additive model source
  • Multiplicative: Trend * Seasonal * Noise, this is used with seasonality scales with the trends, this says the time series is the product of trend, seasonality and noise. Example below:
Multiplicative model
multiplicative model source

I hope this article helps you on your exploratory data journey or to help communicate your data.

Want more? I have a whole series on AWS built-in algorithms starting here and check out these articles on probability distributions, feature engineering and data imputation

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics