Exploratory data analysis using Seaborn

Published in

Analytics Vidhya

5 min readJun 12, 2020

Seaborn can be a Python data visualization library supported matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

The objective of data analysis:

To predict whether the patient will survive after 5 years or not based upon the patient’s age, year of treatment and so the amount of positive lymph nodes.

Attribute information:

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

Data Description:

The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Keys Features

Seaborn is also a statistical plotting library
it has beautiful default styles
it’s also designed to work okay with Pandas data frame objects.

Installing and getting started

To install the most recent release of seaborn, you’ll be able to use pip:

pip install seaborn

It’s also possible to put in the released version using conda:
conda install seaborn

conda install seaborn

Alternatively, you’ll use pip to put in the event version directly from GitHub:

pip install git+https://github.com/mwaskom/seaborn.git

Another option would be to clone the Github repository and install from your local copy:

pip install. Dependencies Python 2.7 or 3.5+

Mandatory dependencies

NumPy (>= 1.9.3)
scipy (>= 0.14.0)
matplotlib (>= 1.4.3)
pandas (>= 0.15.2)

Recommended dependencies

statsmodels (>= 0.5.0)

In[1]:

libraries needed for analysis

In[2]:

load the dataset using pandas

<bound method NDFrame.head of      age  year  nodes  status
0     30    64      1       1
1     30    62      3       1
2     30    65      0       1
3     31    59      2       1
4     31    65      4       1
..   ...   ...    ...     ...
301   75    62      1       1
302   76    67      0       1
303   77    65      3       1
304   78    65      1       2
305   83    58      2       2

[306 rows x 4 columns]>

Data Preparation:

print(cancer.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     306 non-null    int64
 1   year    306 non-null    int64
 2   nodes   306 non-null    int64
 3   status  306 non-null    int64
dtypes: int64(4)
memory usage: 9.7 KB
None

Observations:

There are no missing values in the dataset. So no need for data imputation
The datatype “status” column is an integer. It has to be converted into categorical

High-level Statistics

age        year       nodes      status
count  306.000000  306.000000  306.000000  306.000000
mean    52.457516   62.852941    4.026144    1.264706
std     10.803452    3.249405    7.189654    0.441899
min     30.000000   58.000000    0.000000    1.000000
25%     44.000000   60.000000    0.000000    1.000000
50%     52.000000   63.000000    1.000000    1.000000
75%     60.750000   65.750000    4.000000    2.000000
max     83.000000   69.000000   52.000000    2.000000

target variable distribution
1    225
2     81
Name: status, dtype: int64
1    0.735294
2    0.264706
Name: status, dtype: float64

Observations:

The age of the patients range from 30(min) to 83(max) with a median of 52
Although the maximum number of nodes observed is 52 nearly 75% of the patients of have less than 5 nodes and nearly 25% of the patients have 0 nodes
The dataset contains only a small amount of data(306 rows)
the status column is biased with 73% of the values as yes i,e. (1)

2-D Scatter Plot

Observations:

using the scatter plot we cant distinguish much here

Multivariate Analysis

'\nPair plot in seaborn plots the scatter plot between every two data columns in a given dataframe.\nIt is used to visualize the relationship between two variables\n'

Pair Plot

Observations:

By scattering the data points between year_of_treatment and positive_lymph_nodes, we can see the better separation between the two classes than other scatter plots.

Univariate analysis:

#5.1 Distribution plots
“””
* Distribution plots are used to visually assess how the data points are distributed with respect to its frequency.
* Usually the data points are grouped into bins and the height of the bars representing each group increases with increase in the number of data points
lie within that group. (histogram)
* Probality Density Function (PDF) is the probabilty that the variable takes a value x. (smoothed version of the histogram)
* Here the height of the bar denotes the percentage of data points under the corresponding group
“””

CDF

[30.  35.3 40.6 45.9 51.2 56.5 61.8 67.1 72.4 77.7 83. ]
[0.05228758 0.08823529 0.1503268  0.17320261 0.17973856 0.13398693
 0.13398693 0.05882353 0.02287582 0.00653595]
[58.  59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]
[0.20588235 0.09150327 0.08496732 0.0751634  0.09803922 0.10130719
 0.09150327 0.09150327 0.08169935 0.07843137]
[ 0.   5.2 10.4 15.6 20.8 26.  31.2 36.4 41.6 46.8 52. ]
[0.77124183 0.09803922 0.05882353 0.02614379 0.02941176 0.00653595
 0.00326797 0.         0.00326797 0.00326797]

Box Plots

Violin plots

Observation:

The number of positive lymph nodes of the survivors is highly dense from 0 to 5.
Almost 80% of the patients have less than or equal to 5 positive lymph nodes.
The patients treated after 1966 have a slightly higher chance to survive that the rest. The patients treated before 1959 have a slightly lower chance to survive that the rest.

Conclusion

By scattering the data points between year and nodes, we can see the better separation between the two classes than other scatter plots.

Exploratory data analysis using Seaborn

The objective of data analysis:

Attribute information:

Data Description:

Keys Features

Installing and getting started

Mandatory dependencies

Recommended dependencies

Data Preparation:

Observations:

High-level Statistics

Observations:

2-D Scatter Plot

Observations:

Multivariate Analysis

Pair Plot

Observations:

Univariate analysis:

CDF

CDF

Box Plots

Violin plots

Observation:

Conclusion

Written by Akshay J1n