Complete Exploratory Data Analysis using Python

Learn everything you need to know about exploratory data analysis using Python.

10 min readSep 6, 2021

What is Exploratory Data Analysis?

Exploratory Data Analysis(EDA) is an approach to analyzing datasets to summarize their main characteristics.

By using EDA, we can understand the dataset easily, find patterns, identify outliers and explore the relationship between variables by using non-graphical and graphical techniques.

EDA also helps us to choose which feature should be considered to use for our machine learning model. (aka feature selection)

Now, we know that Exploratory Data Analysis (EDA) is a very important part of a data science project to understand data and get the intuition behind each variable.

In this post, I will be focusing on step-by-step exploratory data analysis to explain EDA smoothly and concisely.

We will use the Medical Cost Personal Dataset to perform our analysis. You can find the dataset from Kaggle by clicking the data

Let’s get started!

Understanding Business Case
Variable Description
Data Understanding
Data Cleaning
Data Visualization

1. Understanding Business Problem

Our business case is to predict customer charges for an insurance company based on given variables. So that the company can decide how much they charge people correctly.

2. Variable Description

After understanding the business case, we need to know our variables before analyzing them. Thus we have a clear understanding as we go further.

I passed the gathering data step since we already have ready to use data from Kaggle :)

Age: Age of the primary beneficiary
Sex: Insurance contractor gender, female, male
BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height,
an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
Children: Number of children covered by health insurance / Number of dependents
Smoker: Smoking
Region: the beneficiary’s residential area in the US, northeast, southeast, southwest and northwest.
Charges: Individual medical costs billed by health insurance

3. Data Understanding

Now we understood our business case and imported the data we need in CSV format. The next step is importing the necessary libraries. I almost every time use Jupyter Notebook for my analysis.

import numpy as np # linear algebra
import pandas as pd # data manipulation and analysisimport matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualizationsns.set_style('whitegrid') # set style for visualizationimport warnings # ignore warnings
warnings.filterwarnings('ignore')

After importing our libraries, we named our dataset ‘df ’

df = pd.read_csv('insurance.csv')

Get the first 5 rows of the dataset.

df.head()

.head() returns the first 5 rows of the dataset. We can also use df.sample(5) to get randomly select 5 rows or df.tail() to get last 5 rows.

df.info()

df.info() method returns information about the DataFrame including the index data type and columns, non-null values, and memory usage.

We see that we have 7 variables and 1338 observations for the dataset. Seems like there are no missing values in the data frame. We can also see the data types and their count by using df.info()

df.shape()

df.shape() method

We see that our dataset has 1338 observations and 7 variables.

df.shape() returns a tuple that represents the dimensions of the data frame.

df.columns

If we want to see only the variable names, then we can use the df.columns method to get all the names.

df.describe()

df.describe() method generates descriptive statistics for us. For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50(median) and upper percentiles.

We can easily notice that the minimum age is 18 and the maximum age is 64. We can also see mean and median values of age are almost the same.

I also noticed that the maximum charge value is 63.770 which might be an unusual value. We can investigate this in the data visualization part.

( With describe method, we can also understand if our data is skewed or not looking at the range of quantiles )

df.describe(include='O')

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

We see that the most frequent value for sex is male and shown in the dataset 676 times.

There are 4 unique regions in our dataset and the most frequent value is southeast and 364 times counted in the data.

Most people are not smokers with a number of 1064 observations.

list(df.sex.unique())

list(df.sex.unique())

We can also see unique values in discrete variable using .unique() method.

4. Data Cleaning

In this part of the EDA. We will check ;

Missing Values
Duplicated Values

The purpose of data cleaning is to get our data ready to analyze and visualize.

df.isnull().sum()

When combining .isnull() method with .sum() we can sum up all the missing values for each variables.

Luckily, there are no missing values in this dataset. We will now proceed to analyze the data, observe patterns, and identify outliers with the help of visualization methods.

df[df.duplicated(keep='first')]

Dataset has only one duplicated observation. We can simply drop this row using the drop_duplicates() method.

df.drop_duplicates(keep='first',inplace=True)

We use some attributes inside the parenthesis;

keep = ‘first’ is used to keep the first row of duplicated ones.

inplace = True is used to change the data frame permanently. If we didn't use inplace attributes, we would not remove the duplicated observation from the dataset.

5. Data Visualization

A picture is worth a thousand words

Now, We understood our dataset in general and checked the missing values. We also deleted duplicated values from the data frame.

The next part of this journey is data visualization! Our goal is to perform univariate, bivariate and multivariate analysis to see the distribution and relationship between variables.

We will use the seaborn library for statistical data visualization. Seaborn is a data visualization library based on matplotlib and my favorite with ease of use.

5.1 Univariate Analysis

The purpose of the univariate analysis is to understand the distribution of values for a single variable.

We can perform univariate analysis with 3 options :

Summary Statistics
Frequency Distributions Table
Charts ( Boxplot, Histogram, Barplot, Pie Chart)

We will perform univariate analysis by using visualization techniques.

Univariate Analysis for Numerical Features

Charges

plt.figure(figsize=(10,6))
sns.distplot(df.charges,color='r')
plt.title('Charges Distribution',size=18)
plt.xlabel('Charges',size=14)
plt.ylabel('Density',size=14)
plt.show()

Let’s begin with sns.distplot() function. Displot stands for distribution plot and shows us distribution and kernel density estimation by default.

The distribution plot shows us how our variable is distributed.

On the other hand, kernel density estimation allows us to estimate the probability density function from the numerical variables. So that we can easily see the probability of getting each value visually.

We see that our data looks like the right(positive) skewed. Most of the charges are between 0 – 10000 dollars.

Age

plt.figure(figsize=(10,6))
sns.histplot(df.age)
plt.title('Age Distribution',size=18)
plt.xlabel('Age',size=14)
plt.ylabel('Count',size=14)
plt.show()

We see that most of the customers are 18–19 years old. The distribution looks like a uniform distribution.

BMI

plt.figure(figsize=(10,6))
plt.hist(df.bmi,color='y')
plt.title('BMI Distribution',size=18)
plt.show()

As seen in the code block, Matplotlib also gives us an option to create a histogram.

BMI of people seems like a normal distribution. That’s what we expected right? Most people have BMI between 27 -34 years old.

Boxplot for Numerical Values

A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

It also helps us to detect outliers using IQR (Inter Quantile Range) method.

plt.figure(figsize = (10,6))
sns.boxplot(df.charges)
plt.title('Distribution Charges',size=18)
plt.show()

By using a boxplot, We can easily interpret our variable if it has outliers. Outliers can be easily removed from our dataset. But we reconsider before removing any of them.

We need to examine or ask domain expertise if they are anomalies or not.

A common way to remove outliers is to use IQR Method.

Q1 = df['charges'].quantile(0.25)
Q3 = df['charges'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

IQR = 11911.37345

After obtaining Inter Quantile Range, we can use the IQR method to see outliers or remove them from the dataset.

df[(df['charges']< Q1-1.5* IQR) | (df['charges']> Q3+1.5* IQR)]

observations with outliers according to IQR method

Now We can easily detect outliers with a boxplot or using python code. In this example. I will proceed with my analysis without removing outliers.

Univariate Analysis for Categorical Features

A bar chart is commonly used to visualize categorical features. We will use sns.countplot() method for sex, children, smoker and region variables.

Gender

plt.figure(figsize=(10,6))
sns.countplot(x = 'sex', data = df)
plt.title('Total Number of Male and Female',size=18)
plt.xlabel('Sex',size=14)
plt.show()

Female and Male customers have almost the same number within the variable.

Children

plt.figure(figsize = (10,6))
sns.countplot(df.children)
plt.title('Children Distribution',size=18)
plt.xlabel('Children',size=14)
plt.ylabel('Count',size=14)
plt.show()

The majority of the people do not have any children.

Few people have 4 and 5 children.

Smoker

plt.figure(figsize = (10,6))
sns.countplot(df.smoker)
plt.title('Smoker Distribution',size=18)
plt.xlabel('Smoker',size=14)
plt.ylabel('Count',size=14)
plt.show()

The number of smokers is almost 4 times more than non-smokers. I showed you the way to see numbers in a non-graphical way below.

Using the value_counts method, we can easily see the number of each value within the variable.

df.smoker.value_counts()

Region

plt.figure(figsize = (10,6))
sns.countplot(df.region,palette='Blues')
plt.title('Region Distribution',size=18)
plt.xlabel('Region',size=14)
plt.ylabel('Count',size=14)
plt.show()

All four regions are almost equally distributed. The number of people from the southeast is slightly more than others.

5.2 Bivariate Analysis

Bivariate analysis is the analysis of exactly two variables. We will use bivariate analysis to find relationships between two variables.

For bivariate analysis, we usually use boxplot(categorical vs numerical), scatterplot(numerical vs numerical), or contingency table(categorical vs categorical).

Age vs Charges

plt.figure(figsize = (10,6))
sns.scatterplot(x='age',y='charges',color='r',data=df)
plt.title('Age vs Charges',size=18)
plt.xlabel('Age',size=14)
plt.ylabel('Charges',size=14)
plt.show()

A scatterplot is a type of data display that shows the relationship between two numerical variables.

We see that there is a weak positive relationship between age and charges values. As age increases charges also slightly increase.

.corr() method also shows us the relationship between the two variables.

print('Correlation between age and charges is : {}'.format(round(df.corr()['age']['charges'],3)))

Smoker vs Charges

plt.figure(figsize = (10,6))
sns.set_style('darkgrid')
sns.boxplot(x='smoker',y='charges',data=df)
plt.title('Smoker vs Charges',size=18);

The boxplot shows us the cost of insurance for smokers is higher than for non-smokers.

Using Pairplot for Numerical Values

Pair plot is another awesome method that shows us the relationship between two numerical values as well as the distribution of each variable.

sns.pairplot(df, 
                 markers="+",
                 diag_kind="kde",
                 kind='reg',
                 plot_kws={'line_kws':{'color':'#aec6cf'}, 
                           'scatter_kws': {'alpha': 0.7, 
                                           'color': 'red'}},
                 corner=True);

5.3 Multivariate Analysis

Correlation

Correlation is used to test relationships between quantitative variables or categorical variables. It’s a measure of how things are related. The table above shows us how we can interpret correlation coefficients.

As we said earlier, seaborn is an awesome library that helps us visualize our variables easily and clearly. The heatmap() method shows us the relationship between numeric variables.

There are different methods to calculate correlation coefficient ;

Pearson
Kendall
Spearman

We will combine the .corr() method with heatmap so that we will be able to see the relationship in the graph. .corr() method is used Pearson correlation by default.

plt.figure(figsize = (10,6))
sns.heatmap(df.corr(),annot=True,square=True,
            cmap='RdBu',
            vmax=1,
            vmin=-1)
plt.title('Correlations Between Variables',size=18);
plt.xticks(size=13)
plt.yticks(size=13)
plt.show()

The heatmap shows us there is a correlation between age and charges. As the age increases insurance charges also increase or vice versa.

We can also see that there is a weak correlation between BMI and charges.

There is almost no relationship between children and charges.

Conclusion

In this post, We examined our dataset using exploratory data analysis and try to understand each variable as well as their relationship with each other.

The main purpose of EDA is to help understand data before making any assumptions. EDA helps us to see distribution, summary statistics, relationships between variables and outliers.

Not sure what to read? I’ve picked Customer Segmentation using K-Means and PCA in Python article for you!

Step by Step Customer Segmentation using K-Means and PCA in Python

Segment your customers for better marketing using Python.

medium.com

Thanks for reading!

That’s all for now! I hope you found this post useful.

Are you interested in Data Science? Let’s connect on Linkedin.