Exploratory Data Analysis (EDA) — Part-2

Arun
Geek Culture
Published in
7 min readSep 3, 2021

In the previous part, we discussed about the first few steps in the process of exploratory data analysis like Identifying the variables, Univariate analysis etc. Now we are going to see more about Bi-Variate analysis and other steps in EDA.

Bi-Variate Analysis

In Bi-variate Analysis we find out the relationship between two variables. The variables can be any combination of continuous or categorical variables. Different methods are used for different combination of variables.

Continuous and Continuous

Scatter plots are the best for two continuous variables. The pattern of scatter plots indicates the relationship between the variables. Scatter plots can show the relationship but not the strength of relationship between the variables, for that we use correlation that varies between -1 and +1 where -1 shows perfect negative linear correlation, 0 shows no correlation and +1 shows perfect positive linear correlation between the variables.

From our example data set, we can see that there is almost no correlation between ‘Age’ and ‘Fare’ where both are continuous variables.

We can see the correlation between age and fare is 0.096067 which is a very weak correlation.

Categorical and Categorical

There are various techniques that can be used to find the relation between two categorical variables.

Two-way table

In two way table method, we start analyzing the relationship by creating a two-way table of count or count % where the rows represents the category of one variable and the columns represent the categories of the other variable.

Stacked Column Chart

The two-way table is not a visual method, to get a visual form of two-way table, we use the stacked column chart.

Chi-squared test

This test is used to determine whether there is a statistically significant difference between the the expected frequencies and the observed frequencies in one or more categories in the two-way table. It is essentially used to derive the statistical significance of relationship between the variables. It returns probability for the computed chi-square distribution with the degree of freedom.

  • If probability is 0, then the variables are dependent.
  • If probability is 1, then the variables are independent.
  • If probability is < 0.05, then the relationship between the variables is significant at 95% confidence.

We are going to use the SciPy library for chi-square test,

As we can see, the p value is very less and we can reject the null hypothesis, hence we can conclude that the Survival of the passenger is dependent on their Pclass. Here we established a relationship between ‘Pclass’ and ‘Survived’, both of which are categorical variables.

Categorical and Continuous

Box plots are really efficient in exploring the relation between categorical and continuous variables. To get the statistical significance, we can perform two types of tests,

Z-test/T-test

Both of the tests assess whether the mean of two groups are statistically different from each other or not. The T-test is different from Z-test in the aspect that it is used when the number of observation for both categories is less than 30.

Anova

This test assesses whether the average of more than two groups is statistically different. We used Scikit learn’s library to import f_classif from feature_selection which implements the Anova test.

After going through the various steps to understand the variables and finding relationship between the variables, we need to find a find a way to treat the missing values of the variables if any. Missing data in the data set can reduce the accuracy of the machine learning model and can lead to the model being biased.

In our example data set, we found the the ‘Cabin’ Column had almost 70% of its data missing, therefore we dropped that column from our data set. Now, we will be seeing various other ways to deal with missing values in the data set.

How to deal with missing values?

Commonly, missing values in a data set occurs at two stages, data extraction and data collection. Errors at data extraction stage are typically easy to find and can be corrected very easily but errors that occur during data collection is harder to correct. Errors during data collection are of four categories,

Missing completely at random

In this category, the probability of missing values is same for all observations.

Missing at random

In this category, the ratio of missing values are different for different values or level of other input variables.

Missing that depends on unobserved predictors

In this category, this missing values are not random rather it is related to the unobserved input variable.

Missing that depends on the missing value itself

In this category, the probability of the missing value is directly correlated with missing value itself.

There are various methods to treat missing values. Some of the are,

Deletion

One way of treating missing values is to delete them from the data set. There are two types of deletion.

List wise deletion

In list wise deletion, we delete observations where any of the variable is missing. But it reduces the power of the machine learning model as it reduces the sample size. So, if we apply, list wise deletion in our data set, we would have to delete almost 70% of our data set.

To avoid that we completely dropped the ‘Cabin’ Column. So that the rest of the observation and the size of the sample is considerably still large.

Pair wise deletion

In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. The disadvantage of pairwise deletion is that it uses different sample size for different variables.

Note that deletion methods are usually used when the nature of missing data is “Missing completely at random” otherwise no random missing values can bias the model output.

Mean/Mode/ Median Imputation

In this method we input missing values with estimated values that has a relationship like mean, median or mode. We take mean or median for quantitative attribute while mode for qualitative attribute.

We are going to work with our example and treat the missing values in the ‘Age’ Variable. As we saw there are almost 177 missing values in the ‘Age’ column.

If we look at the names of the passengers in the data set we can observe that the names has titles like Mr., Mrs., Master etc. We are going to extract these titles (Mr./Mrs./Miss/Master) from the names of the passengers. We are going to group these titles.

Now let’s break down the relationship between titles and age using boxplot.

Now we input the missing age value by finding the median of Age in each titles.

Now we do the same for variables ‘Embarked’ and ‘Fare’, but we simply use fillna() and impute the mode and median respectively since there are only 1–2 missing values.

Prediction Model

In this method we will create a predicting model that predicts the value that will substitute the missing values. For this, we need to divide the data set into two, one with no missing values and other with missing values. Then the first data set becomes the training data set and the second data set will become the testing data set.

KNN Imputation

In the K Nearest Neighbor method, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of the two attributes is determined using distance function.

We used Sklearn’s KNNImputer to fill the missing values.

We saw the various steps in exploratory data analysis and how the analysis can be used to treat the missing values in the data set in order to better our machine learning model. In the next part we will learn more about EDA and Outlier detection and treatment.

--

--

Arun
Geek Culture

I am just a being, striving to find the purpose of it all. Alas there is none!