An EDA checklist

10 steps to include in a successful EDA report

Karen Warmbein
DataSeries
5 min readApr 26, 2020

--

I’ll admit, I haven’t always understood what to do during an Exploratory Data Analysis or, EDA. Is there a specific set of individual analyses to include in an EDA? I don’t know of any hard and fast rules. However, I researched the topic of EDA and discovered 10 helpful tips to include in an EDA written with Python.

Image of John Tukey — from here.

First, before we dive into the tips, I want to mention John Tukey (1915–2000) — an American mathematician. He is known for the development of the Fast Fourier Transform (FFT) algorithm, box plot, and statistical tests that bear his name. He is also known for his coinage of the term “bit” while working with John von Neumann at Bell Labs in the late 1940s and was involved in code-breaking during and after World War II.

Tukey didn’t start studying statistics or mathematics — he entered graduate school at Princeton for chemistry but soon switched to study analysis and topology with the mathematics department. It was only after his time as a student at Princeton that he became interested in statistics. To quote Tukey about this time:

“By the end of late 1945, I was a statistician rather than a topologist.”

He credits his interest to a colleague while working at the Fire Control Research Office (housed at Princeton) that supported the work of the National Defense Research Committee. He remarked:

“It was Charlie [his colleague at the Fire Control Research Office] and the experience of working on the analysis of real data, that converted me to statistics.”

Tukey promoted the term EDA and it’s approach as early as the late 1970s. A memoir writes:

“Tukey often likened EDA to detective work. The role of the data analyst is to listen to the data in as many ways as possible until a plausible “story” of the data is apparent.”

So how can we initially analyze and explore data to find the story it can tell? Here are 10 suggestions to get started.

1. Write and check assumptions about the data

From my research, I discovered that in most companies the data is generally noisy and SQL-like languages are used to obtain the data. However, data scientists make a lot of assumptions about the data either based on their understanding of the system or the knowledge present in the company. However, these assumptions might not hold true. It is important to write down these assumptions and systematically check for those assumptions.

This involves reversing the SQL queries until an empty dataset is returned. For example, if we assume that a certain filter (say fur = 'brown’) in query A returns all the cats with a brown fur, then we should write a query B with adds another filter (animal != cat). If the query returns an empty dataframe then we have successfully verified our initial assumption and it is valid.

2. View the data information.

After you load your data into a Pandas dataframe, take a look and start viewing the information with .info(). The function will output the number of columns, a list of all the column names, and each column's datatype.

Why? This will give you an initial insight into the data. In the next point (#3), I will talk about how you can view the first rows with each column in a dataframe, but there are at least two behaviors that can lead to fallacies.

  • Perhaps your dataframe is too wide to print out all the columns. If that is the case, the Jupyter notebook will replace the middle columns with three dots instead of information.
  • A number, say 16.01, might be listed as an object (a string) instead of a float. If this discrepancy is not resolved, it will lead to errors when modeling!

3. View the first and last rows in the dataframe.

View the first and last rows in the dataframe for an initial sense of the dataset and the information in the columns. You can view the first five rows in a dataframe with .head(); or, if you want to view a different number of rows (say 20), .head(20) will show the first 20 rows. Alternatively, .tail() will print the last five rows and .tail(2) will return the last two rows in the dataframe.

4. View a sample of rows.

Pandas has a function that allows you to look at random rows of a dataframe — .sample(). By default, it returns 5 rows, like .head() and .tail(), but you can return more or less by specifying the number as an argument of a function.

Why? .head() or .tail() might give a false sense of security with the data. (Maybe there is missing data at one of the random rows .sample() prints that .head() or .tail() do not.)

5. View qualitative measures of the data.

Use .describe() to view qualitative measures of the data. Why? For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles.

6. Check for missing values.

Use isnull().sum() to count missing values in the dataframe.

7. Make a boxplot.

Looking at hundreds (or more) numbers can be daunting, to say the least. It’s impossible for me to pick out outliers just by looking at the data. A boxplot will help you hone in on outliers and give you more insights into your data.

A boxplot of the distances to stars with exoplanets. Notice that the distances are always greater than zero and there are numerous outliers in the data.

8. Create a heatmap.

Understand the patterns in relationships by identifying correlations through creating and analyzing a heatmap’s correlation matrix. Why? Again, its nearly impossible to spot patterns in numbers, especially in a large correlation matrix. Turn the matrix into a heatmap to easily spot the correlations by looking at the colors between two attributes.

A heatmap of correlations between exosolar planet’s (and their stars) attributes.

9. Make other plots that visualize the data.

A boxplot and a heatmap arent the only visualizations you can make. Histograms, scatterplots, trees, and geospatial maps are just some of the plots that give more insights into your data.

10. Normalize the data.

Finally, prep the data through normalizing, scaling, or standardizing the data. Why? If you plan to build a model with your data, many machine learning algorithms work better when features are on a similar scale.

Including those 10 steps in your EDA will springboard an initial understanding of the data you are working with. To summarize, here are the steps:

  1. Write and check assumptions about the data.
  2. View the data information.
  3. View the first and last rows in the dataframe.
  4. View a sample of rows.
  5. View qualitative measures of the data.
  6. Check for missing values.
  7. Make a boxplot.
  8. Create a heatmap.
  9. Make other plots that visualize the data
  10. Normalize the data.

Do you include any other methods in an EDA? Post them in the comments! Let’s chat!

--

--