Our new friend is “EDA”

Ahmet Talha Bektaş
4 min readSep 30, 2022

--

This photo was designed by “Midjourney”

With this article, you will have a new friend whose name is “EDA”. Let’s meet with EDA. My reader, EDA is Exploratory data analysis and it is the best friend of Data scientists. EDA, this is my reader who wants to be a good Data Scientist 😁.

Photo by Natalya Khartukova on Unsplash

Why “EDA” is important for us?

After the reading data, if we don’t deduce from the data, we can’t do anything. Therefore, we need EDA to see data, comprehend data, and make clear data.

Let’s find out what is EDA!

We are using plenty of functions. I will use “Titanic data”. You can find my notebook for this story at GitHub. Let’s start!

Photo by Nicolas Hoizey on Unsplash

.head()

head() function shows you the first 5 rows of data; however, if you write a number inside of brackets, you will see the rows of data from 0 to this number. Like .head(15) gives you the first 15 rows of data.

If you don’t know how to read data, you should read this article.

Let’s make an example!

import pandas as pd df=pd.read_csv("titanic.csv")
df.head()

Output:

df.head(15)

Output:

.tail()

tail() function is very similar to the head() function but it shows rows from the last rows.

Let’s make an example!

df.tail()

Output:

df.tail(10)

Output:

.sample()

sample() function gives you one random row in the data. But if you write a number inside of brackets, it returns as many rows as the number you enter.

Let’s make an example!

df.sample()

Output:

df.sample(15)

Output:

.describe()

describe() function calculates statistical quantities (like min, max, mean…) and then gives you these quantities.

Let’s make an example!

df.describe()

Output:

.corr()

Everything has a relationship with each other. corr() function calculates the correlation between columns and gives you the result of that.

Let’s make an example!

df.corr()

Output:

.info()

info() function gives us general information about data . What is the data type? object, integer, or float? How many rows are non-null?
info() function gives us answers to these questions.

Let’s make an example!

df.info()

Output:

.isnull().sum()

isnull().sum() is actually two function: isnull() and sum().
isnull() function return True or False for each row, if it is nan value, it returns True, if not, it returns False.
sum() function gives summation of True values at these columns so that we can easily see nan values for each column.

Let’s make an example!

df.isnull().sum()

Output:

.shape

shape function gives us how much rows and columns have the data .

Let’s make an example!

df.shape

Output:

It shows us there are 891 rows and 12 columns in data.

Photo by david Griffiths on Unsplash

Now you are able to have more understandable data by “EDA”.

But what will we do if we want to see not all columns, just specific parts of data?

If you want to learn these, you should read May I take a “Filtering Data”?

Author:

Ahmet Talha Bektaş

If you want to ask anything to me, you can easily contact me!

📧My email

🔗My LinkedIn

💻My GitHub

👨‍💻My Kaggle

📋 My Medium

--

--