Exploratory Data Analysis (EDA) — Part -1

Arun
Geek Culture
Published in
6 min readSep 1, 2021

We are living in a world that is dominated by data. Data is everywhere and is much more valuable than ever. We have years worth of data that are waiting to be analyzed and put to use. It is very important to understand the nature of the data that we are working with. Exploratory Data Analysis or simply EDA is the process that helps us understand and analyze the intrinsic nature of the data.

Exploratory Data Analysis

“ Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. ” — IBM

Put simply, EDA is nothing but a data exploration technique to understand various aspects of the data. EDA is very important step in machine learning as it helps to improve the accuracy of the models.

Objectives of EDA

  1. Helps understand the underlying structure of the data set.
  2. Helps understand the intrinsic nature and patterns in the data set.
  3. Helps to find and clean the data set of redundancies and outliers.
  4. Helps to understand the relationship between the variables in the data set.

Steps in EDA

There are various steps that are involved in the process of Exploratory Data Analysis, We will take an example data set in order to further our understanding. The data set is from Kaggle’s Titanic-Machine learning from Disaster beginner competition's data set, train.csv with which we have to find the passenger has survived or not, given the data set.

Identifying the Variables

This is the first step in EDA, we need to understand the variables in the data set, like what are the input variables and what are the output variables? Then we need to understand the type of the variables in the data set, like is it Integer? A float value? Or a String value? Lastly we need to analyze if the variables are continuous or categorical. Gender is an example of categorical variable while height is an example of continuous variable.

We could see from the data set, that there are 12 columns,

PassengerId — Which is the id of the passenger.

Survived — whether the passenger survived or not

Pclass — It is a proxy for socioeconomic status where 1 is Upper class, 2 is Middle class and 3 is lower class.

Name — Name of the passenger

Sex — gender of the passenger

Age -age of the passenger which is fractional.

SibSp — whether the passenger has a sibling or spouse with them.

Parch — whether the passenger has a parent or children with them

Ticket -Ticket number

Fare — The fare they had to pay to get into titanic

Cabin — The cabin number

Embarked — Three possible areas of the titanic from which the people embark, S,C and Q.

We will now analyze the structure of the data set.

From the analysis we find that there are a total of 891 rows and 12 columns. There are 3 data types in the set, float, int and object.

We see that our output variable is ‘Survived’ and rest could be taken as our input variables.

We also observe that the cabin data has only 204 rows worth of values in it, which means 70% of the total data is missing, therefore we can drop the cabin column from the data set. We also have missing values in the column ‘Age’ and ‘Embarked’.

Univariate analysis

In this step, we take the variables one by one and analyze each of them. The method to perform this analysis depends on whether the variable is continuous or categorical.

We will be looking at each of the variables in our example data set.

Survived

First, we take our output variable, ‘Survived’ and we plot the total count of people who survived and not survived in our training data set.

We find that the total people survived is ~ 340. Therefore the overall probability for survival is ~ 38%.

Pclass

Pclass is a predictor or input variable from this data set. When we countplot() Pclass using seaborn, we see that the lower class is much higher in number followed by higher class and then middle class.

If we plot Pclass against survived, we see that the lower class is less likely to survive while the higher class is more likely to survive.

Sex

When we count the total number of men vs women on titanic, we observe that the number of women on the ship is lesser than the men.

When we plot Sex against Survived, We see that women has a much higher chance of survival than men. Women had more than 70% chance of survival.

Age

The previous variables that we dealt with were categorical variables but age is a continuous variable. In continuous variables we need to understand the central tendency and spread of the variable. From the histogram, we can observe that most of the passengers are aged between 20 to 40.

From the boxplot() Age against Survived and we observe that the younger passengers tends to have more probability of survival.

SibSp

We observe that Passengers that has siblings/spouse, passengers with 1 sibling/spouse are more in numbers

We now plot SibSp against Survived and observe from the chart that passengers with 1 sibling/spouse are more likely to survive compared to those not.

Parch

From the chart we can observe that almost 70% of the passengers do not travel with parents/children.

We also observe that passengers travelling with parents/children are more likely to survive.

Fare

This is yet another continuous variable in our data set. When we plot the distribution chart for Fare, we see that the distribution is right skewed.

We also observe that the people who survived paid a relatively higher fare than those not.

Embarked

We observe that most of the passengers that boarded titanic was from port C.

We observe that passengers embarked at port C are more likely to survive.

We will be learning more about EDA in the next part where we will discuss about, Bi-Variate Analysis, Missing values treatment, Outlier treatment, Variable transformation, Variable creation etc.

Read the second part here,

--

--

Arun
Geek Culture

I am just a being, striving to find the purpose of it all. Alas there is none!