10 Things to do when conducting your Exploratory Data Analysis (EDA)

With Python

Alifia C Harmadi
Data Folks Indonesia
7 min readSep 26, 2021

--

Image by Adli Wahid on Unsplash

Every time before we start analysing data in any method either manual or using computing tools, we always need to check and understand the data that we have. This intends to find out whether the data is sufficient or ready enough to proceed to the analytical process. As we already know, data is not always clean and ready to use. There are a lot of data in this world that is still messy, inconsistent, having many missing and duplicate values, imbalanced data, and many more.

“Data is not always clean like Kaggle Dataset” - someone on the internet

So, some of you might already deal with these issues and understand enough that data is not always clean like Kaggle Dataset, especially real-world data. Thus, Exploratory Data Analysis (EDA) needs to be done beforehand. By doing this, we can investigate data sets, discover patterns, spot anomalies, test a hypothesis, and check assumptions. This process also often employs data visualization methods. So, without any longer, I will guide you to do the things needed for EDA with Python.

Exploratory Data Analysis (EDA) Steps with Python

To do Exploratory Data Analysis in Python, we need some python libraries such as Numpy, Pandas, and Seaborn. The last two libraries will be used for visualisation.

1. Check data shape (num of Rows & Columns)

This can be done by just simply use, the code down below

The output will give you the information on the number of rows and columns in your dataset. In the example above, there are 821,534 rows and 14 columns in the data. One of these 14 columns is a dependent variable which usually will be used as a target column for analysis with machine learning later. The rest of them are mostly independent variables.

2. Check each data type of columns and missing values

The output will usually be like this:

Source: Google Images

Here as we can see, there are 7,787 rows with 12 columns. From these 12 columns, several columns have missing values which are director, cast, country, date_added, and rating.

For the data type, we only have object and integer here. However, if we see in column date_added, it should have a data type of datetime, right? not an object data type. We will correct this data type into datetime later in the next steps.

3. Splitting values

On some occasions, we might want to split the value of a column. For example, there is an address column that includes city and country. We want to split it into two columns which column of city and country.

This can be achieved by using the following code:

4. Change the data type

We can use astype() function from pandas. For example, I want to replace the data type of Customer Number, IsPurchased, Total Spend, and Dates

5. Check the percentages of missing value

I am personally often doing this so I can have a clear reason to drop or dealing with the missing values. If the percentage of missing values is high and it is not an important column, I sometimes just dropped the corresponding column😬.

Source: Author

We can see that SHOOTING and UCR_PART variables have the highest percentages of missing values more than and nearly 50% of the total values. Thus, for this case, I just droped these two columns as they are also not important columns that need to be used for the analytical process. For the rest of the columns like OFFENSE_CODE_GROUP. I just replaced the missing values with the value from OFFENSE_CODE as it has similar values.

6. Summary Statistics

Output:

Source: Goole Images

This function from pandas is used to return the count, mean std, min, quartiles, and max. From this, you could already see the data distribution that you have for each and determine whether there are outliers or not.

7. Check value counts for a specific column

Here I want to see counts of each value in Player column in the dataset.

Output:

Source: Author

I sometimes use this function to see if there are duplicate values in the column. We can see here that there is more than 1 value in several players.

8. Check duplicate values and deal with it

Once we know that there are several player that occurred more than 1 in the dataset. We need to do further investigation using the following code; Take consideration to Player named “Ersan Ilyasova”.

Output:

Source: Author

This code will give all information from every column for each value. Next, we need to determine which one do we want to keep. It could be based on the latest, older value, or any. It can be done by using this code; for example; I want to keep the value in the first row.

9. See the data distribution and data anomaly

From the summary statistics before, we might already know which columns that potentially having data anomalies. Here, we want to see visually how the data distribution is using the Seaborn library.

Source: Google Images

If you want to measure the skewness and kurtosis of the distribution, you can use the code down below;

We see here that the age distribution was skewed to the right. Now, Let’s check the outlier for the total_bill column with Boxplot.

Source: Google Images

It can be seen clearly that outliers were starting after 40. It is up to you to deal with outliers. You can drop the outliers or do data transformation. Here, I put the options that you can do to transform non-normal distributions.

Source: Google Images

10. Check the correlation between variables in the data

Checking the correlation between variables is also necessary to see potential a feature that we can use for further analysis or building a model later. We can use a correlation matrix to get this.

First, create a variable that holds or has the dataframe value.

Next, we visualise the matrix with Seaborn library

Output:

Source: Author

See there, GDP per capita, Family, Life Expectancy, and Freedom have a correlation >0.5 with Happiness_Score. This indicates that they are correlated with each other.

I think that’s all that I can share, for now, 10 things that I usually do when conducting EDA before start to replace, transform and modify the data to make it ready for further analysis.

You can see my other stories here:

https://alifiaharmd.medium.com/

and if you like unsupervised learning, I write an article about it. Check it in here:

Thank You!!

--

--

Alifia C Harmadi
Data Folks Indonesia

A philomath & data engineer. Passionate about ML, DL & data storytelling.