Exploring the Data: A Beginner’s Guide to EDA

Anwita Ghosh
15 min readApr 30, 2023

--

Generated on Canva by Anwita G

Here’s a question: what would you do if you were Sherlock Holmes, and a Bohemian Grand Duke would pay you handsomely to help him get incriminating material out of his ex girlfriend’s hands? Here’s the thing: Duke’s already tried to beg, borrow and steal, and not a single of them has worked. Yet. How do we know that, reader? Well, Mr. Holmes’ methods of getting data out of a masked Duke are pure poetry, as are his inferencing skills from limited data (deduction he calls it)— but then you’d know it too, if you were reading ‘A Scandal in Bohemia’. Or any Arthur Conan Doyle story, really — the BBC show’s a whole scam.

But we’re not Sherlock Holmes, reader. We don’t all run machine learning algorithms in our brains like Mr. Holmes does — we don’t even run enough in our brains to guess at which algorithm we need to use on our data. Here’s to a secret: we’re never getting to the first part — that’s what our machines are for. We just need to worry about giving a machine its marching orders.

But all that’s the fancy stuff up there. Let’s add a base layer of cloth to our mannequin before getting to the fluff and the feathers, shall we? Now, imagine staring at five different datasets, each larger than the next, and the last larger than a freight train in length. Imagine all of that’s stock market data from several major markets around the world: they’re large, they’re squiggly, and if you’re not on to them in time, well- 🙈.

And so, we need to take a good look at our data. As in, I can stare at those datasets all I want, but that’s not going to be of any help. They’re just words and numbers, not actual information. To ‘get a good look’ at my data, I’ll need to go digging for insights — and the first layers of dirt I uncover would be called Exploratory Data Analysis — or EDA — in formal language.

As the name suggests, I’m trying to explore my data — getting trial packs for a few new brands of cosmetics and comparing them before buying full — size packs of the one I like, if you’d like to call it that. Basically, given a data set, I would like to know:

  • What are the features of this data set — what defines it?
  • Which variables in this dataset are related to each other?
  • The nature of the relationships between variables, and,
  • Which of these variables and relationships are important to the problem at hand?

To answer the above, we go through the number of steps, which form the EDA process. Please note, I can keep adding steps to the EDA process forever: there are many insights to be drawn from data before getting into the actual problem solving process. Our job is to decide which of these are important to our problem. And so, reader, I present the basic steps of a typical EDA process on Jupyter Notebook.
Note that one can always switch some of these steps around, add/subtract a few, or even do things differently. Much depends on the type of data we’re using and the problem we’re trying to solve. Here’s a basic flow, nevertheless.

The Data

I have created a toy dataset — or a small, basic data set to play around with. Though it is not real world data (nor does it resemble real world data, except for its tabular shape), the EDA for a real dataset would look quite similar to the flow given here.

Before we actually create the data, let’s import a few libraries to make our lives a little easier:

# Importing relevant libraries to Python

###### 1. Dealing with Dataframes
import pandas as pd

###### 2. Dealing with Arrays
import numpy as np

###### 3. Making Graphs
import matplotlib.pyplot as plt
import seaborn as sns

###### 4. Statistical Functions
from scipy import stats

These libraries exist to ease our way into dealing with our datasets in code.

Now, suppose a bank stores data on the loan applications it receives in a day, which includes the details of each applicant, and whether their loan was approved. That is, the dataset formed has columns titled:

  1. Name — A customer’s name
  2. Age — Their age
  3. Education Level — Whether the applicant has graduated High School (HS) or whether they have a College Degree (CD)
  4. Salary — How much they earn in a month
  5. Loan Status — Whether their loan status was approved or not. This is a binary variable that takes the value 1 if their loan was approved, and 0 if it was not.

And so, we first create a dictionary and proceed to convert it onto a data frame.

# Creating the dataframe

Data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Education Level':['HS', 'HS', 'CD', 'HS', 'CD', 'CD', 'CD', 'HS', 'CD', 'HS'],
'Age': [22, 35, 21, 40, 44, 31, 50, 28, 55, 39],
'Salary': [2500, 3000, 4900, 3700, 6000, 5200, 1900, 6700, 4200,2400],
'Loan Status':[1,0, 1, 1, 0, 0, 0,1, 0,1]}
df = pd.DataFrame(Data)
df

Which gives us a dataset looking like this:

Created on Jupyter Notebooks by Anwita G

If we had ready made datasets handed to us by our bosses (or downloaded off the internet), we could just import them to Jupyter.

Now that we have our data, we find a few basic details about it — like how large it is, the names of its columns, the types of data in these columns, whether these columns contain any empty spaces or not, and how many unique values these columns have.

All of this is before we start summarising the data.

Basic Properties of the Data:

  1. The Shape of the Data
    Or the number of rows and columns in the dataset. We can find this with a simple command:
# Size of our Dataframe: The Number of Rows and Columns

df.shape

This command returns a tuple of two numbers, the first of which gives us the number of rows in a dataset, and the second gives the number of columns.
From our little dataset, we get the shape:

Created on Jupyter Notebooks by Anwita G

This dataset has ten rows and five columns — a tiny dataset, by any standards, but okay. The ‘.shape’ command would count the rows and columns of larger datasets with just as much ease.

2. The Columns of Our Dataset
Now that the machine has counted our columns for us, let’s see what these columns are. This would give us an idea of which variables we’re dealing with. Regardless of how many columns we have, there is another simple command for listing their names.

# List of Column Names

df.columns

This would return a list, containing the names of our columns. That is, the command produces the output:

Created in Jupyter Notebooks by Anwita G

And so, our columns are called ‘Name’, ‘Education Level’, ‘Age’, ‘Salary’, and ‘Loan Status’.

Next up, we take a look at the types of data in our columns.

3. The Types of Data in These Columns
Now that we know what data types our columns hold. This would help us decide how to deal with these columns. That is we wouldn’t treat columns with categorical type data and those with numerical type data the same way. To do that, we type in the command:

# Data Types

df.dtypes

This gives the type of data for each column. Jupyter Notebook would return ‘object’ for a column containing non-numeric or categorical data, ‘int64’ or ‘int32’ for discrete, integer type numbers, and ‘float’ for decimals, where both ‘int’ and ‘float’ data type represent numeric data types. For our dataset, we get the output:

Output in Jupyter Notebook, Anwita G

Which means, of all our columns, ‘Name’ and ‘Education Level’ contain non — numeric data, while ‘Age’, ‘Salary’ and ‘Loan Status’ contain numeric data.

Next up, we check if our dataframe has any missing values or not.

4. Checking for Null Values
Null Values are another word for missing values in the dataframe. That is, they are the empty spaces —or holes — in the dataframe, which Python often reads as ‘NaN’. These empty spaces often hinders analysis, as most softwares do not know what to do with them.
First, let’s find out if there any missing values:

# Presence of Null Values

val = df.isnull().values.any()
val

This would return a Boolean value based on whether there are any null values in the data. A value of ‘True’ indicates the dataset does, indeed, have missing values, while a value of ‘False’ indicates that there are no ‘holes’ in the dataset. From our little dataset, the above command returns:

Output on Jupyter Notebooks, Anwita G

That is, we find ourselves with a value of ‘False’, meaning our dataset has no null values. If we’d like to say that in numbers, we can type in the command:

# Number of Null Values

df.isnull().values.sum()

Which gives the total number of null values in our data. Thus, the output for our data frame would be:

No. of null values, Jupyter Notebooks, Anwita G

That is, we have exactly 0 missing values or empty spots in our data. Since we don’t have any null values, we don’t need to dwell on them any further.

However, if a dataset does have any null values, there are a few ways of dealing with them.

a. Dropping them
One of the most convenient things we could do about our null values, is drop them with a command:

# Drop Null Values

df.dropna()

This command boots the whole row containing the empty spot right off the table. That is, if there are null values in multiple rows of the data, all those rows get dropped in their entirity. In other words, we’re also losing the other information the row might provide.

Therefore, the wholesale dropping of null values might not be a very good idea. That is, unless they’re such a small part of the dataframe that losing the extra information their rows would provide wouldn’t have much effect on the rest of the analysis.

b. Filling them in
A fairly common way of dealing with null values is to fill them in. While we might not now what values these empty spots were originally supposed to contain, we may as well make our best guesses, if only to avoid information loss from dropping them altogether.
Standard practice is to fill empty spaces in numeric columns with their respective arithmetic means or median value, and non-numeric columns with their respective modes (i.e. the most frequently appearing item in each column).
Sometimes, one may also fill these empty spots with ‘0’ out of convenience. However, the presence of these zeroes may affect the summary statistics that can be generated from these columns, leading to incorrect values and false inferences.
We can fill the empty spaces caused by null values using the command:

# Fill Empty Spots

df.fillna()

Where we put the value we want to fill our empty spots with in the bracket.

Note that Python doesn’t always recognise missing values, especially if they’re represented by characters like ‘?’ or ‘-’, as it reads them as data points in the columns. In such cases, we need to manually convert these characters into null values before dealing with them.

5. Unique Values in the Dataset
Sometimes, we’re interested in the number of unique values our columns contain for a variety of reasons, for eg., figuring out how many categories we may divide our data into based on a particular column, among other uses. A common way of doing this is to use a lambda function (i.e. a small anonymous function) like so:

# Unique Values

df.apply(lambda x: len(x.unique()))

Which gives us the output:

Output on Jupyter Notebooks, Anwita G

Thus, ‘Education Level’ and ‘Loan Status’ can take 2 unique values each, while each value in the other columns is unique.

Statistical Exploration of Data

Now that we have a basic understanding of our data, let’s dig a little deeper and explore our dataset using some basic statistical methods.

  1. Descriptive Statistics
    Descriptive statistics provide a quick summary of our data. The ‘.describe()’ method in pandas applies basic statistical computations on the datasets, like the columnwise count of values, extreme values, mean, etc. Once it has these values, the command then presents them in a sngle table.
    By default, the command skips the missing values in the data while computing the summary statistics of the data.
    For our data, we have:
# Summary Statistics

df.describe().T

Here, I’ve asked Jupyter to transpose the table (i.e. present the rows in the original table as columns and vice versa). This is to show the summary statistics for a variable as a single row. Thus, we get the table for our dataset:

Note that the table shows summary statistics for only the numeric columns in the data. In the table, ‘std’ is the column for the standard deviation (a measure of ‘spread’ of data) of a variable, ‘25%’ shows the first quartile, ‘50%’ shows the second quartile or the median, and ‘75%’ shows the third quartile of our data.

We can construct boxplots to show all of these values for a column on a single graph.

2. Skewness
The skewness is a measure of deviation from symmetry of a variable about its mean. That is, it tells us whether most of the data in a column is clustered around its mean, like in a normal distribution, or away from it to the left or to the right.

# Skewness

df.skew()

Which gives the output:

Skewness, Jupyter Notebooks, Anwita G

We see that the skewness for ‘Loan Status’ is exactly zero. This needn’t mean that ‘Loan Status’ follows a normal distribution or has a bell — shaped histogram. In fact, the histogram for ‘Loan Status’ is nowhere near bell — shaped:

Generated on Jupyter Notebooks by Anwita G

In fact, as we can see in the histogram above, ‘Loan Status’ has exactly 2 values, namely 0 and 1, which are located at equal distances from their mean of 0.5. This gives us a skewness of 0.

Moreover, the skewness for ‘Age’ and ‘Salary’ are fairly low positive numbers, i.e. most of their values are clustered only slightly left of the mean.

3. Kurtosis
Kurtosis is a measure of ‘tailed-ness’ of data w.r.t a normal distribution, i.e. whether the tails of the distribution are fat (indicative of a larger number of outliers) or thin (indicative of a smaller number of outliers).
It can be seen in the relative ‘flat-ness’ of the curve w.r.t. the normal bell curve — i.e. a flatter curve would have a negative kurtosis, indicating fewer points located close to the mean, while a sharper, taller curve would have a positive kurtosis, indicating more points located closer to the mean.
We can find the kurtosis for our dataset as follows:

# Kurtosis

df.kurt()

And we get the values:

Kurtosis, Jupyter Notebooks, Anwita G

Thus, we see that the kurtosis for ‘Age’, ‘Salary’ and ‘Loan Status’ is negative, meaning their distributions are flatter than the normal distribution or that fewer data points are located near the mean.

4. Outliers
Next, we come to outliers. These are data points that differ significantly from the others in the dataset.
Outliers may occur for a variety of reasons, like novel, never seen before data or even measurement error. Either way, these often have serious effects on the dataset — like exaggerating the mean of the data.
A standard method for catching outliers is to look for the data points that lie beyond a distance of 1.5 times the inter-quartile range (or the IQR — i.e. the difference between the first and third quartiles) below the 1st quartile, and 1.5 times the IQR above the 3rd quartile.
This process can be demonstrated by:
4.a. The 1st Quartile:
The command:

# First Quartile

Q1 = df.quantile(q = 0.25)
Q1

Gives us the first quartile (or 25th percentile) of the numeric columns in the data, which is:

Q1, Jupyter notebooks, Anwita G

4.b. The Third Quartile
A command similar to the one above,

# 3rd Quartile

Q3 = df.quantile(q = 0.75)
Q3

Gives the 3rd Quartile (or 75th percentile) of the numeric columns of the data, which is:

Q3, Jupyter Notebooks, Anwita G

4.c. The Inter-Quartile range
The Inter-Quartile Range (or IQR) is simply the third quartile minus the first quartile. That is,

# IQR

IQR = Q3 - Q1
IQR

Which gives us,

IQR, Jupyter Notebooks, Anwita G

The IQR is an approximation of the ‘spread’ of the middle fifty percent of the data (that is, the values closest to the mean). It may also be called the ‘midspread’.

4.d. 1.5 * IQR
1.5 times the IQR is simply a constant used for discerning outliers. We subtract this number from the first quartile, and add this number to the third quartile to get a ‘range’ for the ‘regular’ data. Anything beyond this range is suspected to be an outlier.
Multiplying our IQR by 1.5, i.e.

# 1.5 times IQR

limit = 1.5 * IQR
limit

We get the values of 1.5*IQR (or ‘limit’) for each column:

1.5*IQR, Jupyter Notebook, Anwita G

Now, adding this to the first quartile,

# "lower limit"

Lower_lim = Q1- limit
Lower_lim

We get the ‘lower limit’ for our ‘regular’ data, i.e.

Lower limit, Jupyter Notebooks, Anwita G

And adding 1.5*IQR to the third quartile,

"upper limit"

Higher_lim = Q3 + limit
Higher_lim

We get the ‘upper limit’ for our ‘regular data’:

Upper limit, Jupyter Notebooks, Anwita G

Anything beyond these ‘limits’ are considered outliers. These can be found by:

# Outliers

Outliers = df[df < Lower_lim].count() + df[df > Higher_lim].count()

Which gives us:

Count of Outliers, Jupyter Notebooks, Anwita G

This means that our data has no outliers in any column. That is, no point ‘stands out’ from the rest of the data. However, if there are outliers in the data, we may deploy one of many methods to deal with them.

5. The Correlation Matrix
Until now, we have been looking at one variable at a time. And now, we look at two.
The coefficient of correlation (or more accurately, Pearson’s Correlation Coefficient), denoted by ‘r’ gives the strength of the linear relationship between two variables. In other words, the correlation coefficient between two numbers gives how much one of them would change when the other does.
However, please note that the correlation coefficient only gives the strength and direction of the linear relationship between variables. It doesn’t imply which variable depends on the other.
The correlation matrix shows the correlation coefficients between all possible pairings of the variables in our data.
The correlation matrix for our data is given by:

# Constructing the correlation matrix

cm = df.corr()
cm

Which gives us the correlation matrix:

Correlation Matrix, Jupyter Notebooks, Anwita G

Thus, ‘Loan Status’ has a fairly high negative correlation with ‘Age’ (i.e. older people might get less loans), and a very small correlation with ‘Salary’ (i.e. they’re practically uncorrelated). ‘Age’ and ‘Salary’ have a low negative correlation (i.e. older people might get a lower salary, but the fall is low).

And that’s a wrap, folks. Please note that this isn’t the end for EDA. There are many things that one can keep adding to the process, depending on the kind of problem they may want to solve, or the kind of data that has come to them. This may involve various kinds of plots and graphs, or even slightly more advanced processes like ANOVA.
Either way, the above process does give some basic insight into our data, before we delve into actual problem solving. In fact, one typically forms the formal problem statement — the formal, mathematical presentation of the problem they’re trying to solve — after conducting EDA. All in all, EDA forms a crucial step in the problem solving process, as it clarifies the exact problem to be solved.

That said, if you were looking for a basic understanding of the EDA process, I hope this article helped. If you have any feedback on this article, reader, don’t hesitate to let me know. And don’t forget to follow this blog for more!

--

--

Anwita Ghosh

Data Scientist in FinTech, PGP Data Science, M.Sc. Economics