Exploratory data analysis using Seaborn

Akshay J1n
Analytics Vidhya
Published in
5 min readJun 12, 2020

Seaborn can be a Python data visualization library supported matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

The objective of data analysis:

To predict whether the patient will survive after 5 years or not based upon the patient’s age, year of treatment and so the amount of positive lymph nodes.

Attribute information:

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

Data Description:

The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Keys Features

  • Seaborn is also a statistical plotting library
  • it has beautiful default styles
  • it’s also designed to work okay with Pandas data frame objects.

Installing and getting started

To install the most recent release of seaborn, you’ll be able to use pip:

pip install seaborn

It’s also possible to put in the released version using conda:
conda install seaborn

conda install seaborn

Alternatively, you’ll use pip to put in the event version directly from GitHub:

pip install git+https://github.com/mwaskom/seaborn.git

Another option would be to clone the Github repository and install from your local copy:

pip install. Dependencies Python 2.7 or 3.5+

Mandatory dependencies

NumPy (>= 1.9.3)
scipy (>= 0.14.0)
matplotlib (>= 1.4.3)
pandas (>= 0.15.2)

Recommended dependencies

statsmodels (>= 0.5.0)

In[1]:

libraries needed for analysis

In[2]:

load the dataset using pandas

<bound method NDFrame.head of      age  year  nodes  status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
.. ... ... ... ...
301 75 62 1 1
302 76 67 0 1
303 77 65 3 1
304 78 65 1 2
305 83 58 2 2
[306 rows x 4 columns]>

Data Preparation:

print(cancer.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 306 non-null int64
1 year 306 non-null int64
2 nodes 306 non-null int64
3 status 306 non-null int64
dtypes: int64(4)
memory usage: 9.7 KB
None

Observations:

  • There are no missing values in the dataset. So no need for data imputation
  • The datatype “status” column is an integer. It has to be converted into categorical

High-level Statistics

age        year       nodes      status
count 306.000000 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144 1.264706
std 10.803452 3.249405 7.189654 0.441899
min 30.000000 58.000000 0.000000 1.000000
25% 44.000000 60.000000 0.000000 1.000000
50% 52.000000 63.000000 1.000000 1.000000
75% 60.750000 65.750000 4.000000 2.000000
max 83.000000 69.000000 52.000000 2.000000
target variable distribution
1 225
2 81
Name: status, dtype: int64
1 0.735294
2 0.264706
Name: status, dtype: float64

Observations:

  • The age of the patients range from 30(min) to 83(max) with a median of 52
  • Although the maximum number of nodes observed is 52 nearly 75% of the patients of have less than 5 nodes and nearly 25% of the patients have 0 nodes
  • The dataset contains only a small amount of data(306 rows)
  • the status column is biased with 73% of the values as yes i,e. (1)

2-D Scatter Plot

the output of 2d scatter plot

Observations:

  • using the scatter plot we cant distinguish much here

Multivariate Analysis

'\nPair plot in seaborn plots the scatter plot between every two data columns in a given dataframe.\nIt is used to visualize the relationship between two variables\n'

Pair Plot

the output of the pair plot

Observations:

  • By scattering the data points between year_of_treatment and positive_lymph_nodes, we can see the better separation between the two classes than other scatter plots.

Univariate analysis:

#5.1 Distribution plots
“””
* Distribution plots are used to visually assess how the data points are distributed with respect to its frequency.
* Usually the data points are grouped into bins and the height of the bars representing each group increases with increase in the number of data points
lie within that group. (histogram)
* Probality Density Function (PDF) is the probabilty that the variable takes a value x. (smoothed version of the histogram)
* Here the height of the bar denotes the percentage of data points under the corresponding group
“””

CDF

CDF

[30.  35.3 40.6 45.9 51.2 56.5 61.8 67.1 72.4 77.7 83. ]
[0.05228758 0.08823529 0.1503268 0.17320261 0.17973856 0.13398693
0.13398693 0.05882353 0.02287582 0.00653595]
[58. 59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]
[0.20588235 0.09150327 0.08496732 0.0751634 0.09803922 0.10130719
0.09150327 0.09150327 0.08169935 0.07843137]
[ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]
[0.77124183 0.09803922 0.05882353 0.02614379 0.02941176 0.00653595
0.00326797 0. 0.00326797 0.00326797]

Box Plots

Violin plots

Observation:

  • The number of positive lymph nodes of the survivors is highly dense from 0 to 5.
  • Almost 80% of the patients have less than or equal to 5 positive lymph nodes.
  • The patients treated after 1966 have a slightly higher chance to survive that the rest. The patients treated before 1959 have a slightly lower chance to survive that the rest.

Conclusion

  • By scattering the data points between year and nodes, we can see the better separation between the two classes than other scatter plots.

--

--