Exploratory Data Analysis in Python

Priya Chaurasiya
5 min readApr 3, 2022

--

What is Exploratory Data Analysis (EDA) ?

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. This step is very important especially when we arrive at modeling the data in order to apply Machine learning. Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

Checklist for EDA:

1. Checking the different features present in the dataset & its shape

2. Checking the data type of each columns

3. Encoding the labels for classification problems

4. Checking for missing values

5. Descriptive summary of the dataset

6. Checking the distribution of the target variable

7. Grouping the data based on target variable

Data Visualization:

8. Distribution plot for all the columns

9. Count plot for Categorical columns

10. Pair plot

11. Checking for Outliers

12. Correlation matrix

13. Inference from EDA

Understanding EDA with an interesting use case in Python:

Dataset: In order to understand EDA, we will be working on the Breast Cancer Wisconsin (Diagnostic) Data Set. Here, Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Using this dataset, we can build a classification system which can predict whether a person has Benign or Malignant tumor. Malignant tumors are considered cancerous. In the EDA part, we will try to understand the characteristics of the data and its descriptive measures.

As a starter, let’s import the dependencies.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

The next step is to load the datset from the csv file to a pandas DataFrame:

breast_cancer_data = pd.read_csv('/content/data.csv')
  1. Checking the different features present in the dataset:

For this, we can use the head() function in pandas, this will print the first five rows of the dataset:

breast_cancer_data.head()

Checking the shape of the dataset, this will print the rows and columns present in the dataset:

breast_cancer_data.shape

(569, 32)

As we can see here, the dataset contains 569 rows (data points) and 32 columns (features).

The second column is “diagnosis”, where, “M” represents Malignant & “B” represents Benign. This is our Target column.

2. Checking the data type of each columns and non-null count:

breast_cancer_data.info()

I’ll include the first 10 rows of the output.

As we can see here, the ‘id’ is in the form of integer; ‘diagnosis’ column is in the form of ‘object’. So, it is a categorical variable. Whereas, the remaining are continuous numerical variables.

3. Encoding the labels for classification problems:

Now let’s encode the “diagnosis” column, so that all the columns are in the numerical format. We will encode “B” as 0 and “M” as 1.

label_encode = LabelEncoder()
labels = label_encode.fit_transform(breast_cancer_data['diagnosis'])
breast_cancer_data['target'] = labels
breast_cancer_data.drop(columns=['id','diagnosis'], axis=1, inplace=True)

Here, we are encoding the “diagnosis” column, storing it in a different column called “target” and removing the “diagnosis” column. We are also removing the “id” column as it is not necessary.

4. Checking for missing values:

Now, let’s check whether there are any missing values in the dataset.

breast_cancer_data.isnull().sum()

The above line of code gives an output on how many missing values are there in each column. I included first few rows of the output here.

As we can see here, there are no missing values in this case. If there are missing values in a dataset, we will handle them by dropping them or replacing them. This method is called Imputation.

5. Descriptive summary of the dataset:

The next step is to get some statistical measures about the dataset. This is what we call as “Descriptive Statistics” which is a summarization of the data. For this, we can use describe() function in pandas.

breast_cancer_data.describe()
Showing few columns of the output in this image

The main inference that we can get here is, for most of the columns, the mean value is larger than median value (50th percentile: 50%). This is an indication that those features have a right skewed data. This information will be visible for us when we create distribution plot for individual features in Data Visualization part.

6. Checking the distribution of the target variable:

The next step is to check the distribution of the dataset based on the target variable to see if there is an imbalance. This is an exclusive step for Classification problems.

breast_cancer_data['diagnosis'].value_counts()

“0” 357

“1” 212

Name: diagnosis, dtype: int64

As we can see, there is a slight imbalance in the dataset ( number of Benign(0) cases is more than number of Malignant(1) cases). The imbalance is not too much to worry about in this case.

7. Grouping the data based on target variable:

This step is also exclusive for Classification problems. This is to group the dataset based on the target variable. We will be grouping the data points as 0 & 1 representing Benign & Malignant respectively. This grouping is done with the mean value of all the columns.

breast_cancer_data.groupby('target').mean()

This clearly tells us that the mean value for most of the features are greater for Malignant cases than the mean value for Benign cases. This inference is very important.

Inferences so far:

  • The dataset has 569 rows & 32 columns.
  • We don’t have any missing values in the dataset.
  • We could see that the data is right skewed for most of the features.
  • There is a slight imbalance in the dataset (Benign cases are more than Malignant cases).
  • The mean value for most of the features are greater for Malignant cases than the mean value for Benign cases.

NOTE: The EDA is not completed yet. We have to do some Data Visualization tasks to understand the data better.

--

--