EDA on Haberman Data set

13 min readFeb 1, 2019

Exploratory Data Analysis on Haberman data set

Data Set Description: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of data points: 306

Number of Attributes: 4 (including the class attribute)

Attribute Information:

Age of patient at time of operation (numerical)
Patient’s year of operation (year — 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute)
— 1 = the patient survived 5 years or longer
— 2 = the patient died within 5 years

Missing Attribute Values: None

To know more about the data set:

Haberman's Survival Data Set

Survival of patients who had undergone surgery for breast cancer

www.kaggle.com

Importing libraries

We’ll first import all the libraries needed for performing EDA.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Loading the data set

We’ll use Pandas read_csv() method to load the dataset.

haberman = pd.read_csv('haberman.csv')# head of the data
haberman.head()

# (Q) how many data-points and features?haberman.shape # indeed, we have 306 data points, but shape is showing 305 ???? What's wrong??Output:(305, 4)

Hey, Something wrong. The data set description tells us that there are 306 data points in the data set. But the ‘shape’, gives us only 305 data points (rows). Also, the column names are not displayed.

Manually checking the data set file, we can observe that there are no column names specified in the CSV file. In this case, Pandas automatically takes the first row of the data set as column names.

To tell Pandas not to take the first row as column names, specify ‘header=None’ option while loading the data set.

haberman = pd.read_csv('haberman.csv', header=None)# now check the head of the data
haberman.head()

# (Q) how many data-points and features?
haberman.shape # now we can see correct number of data pointsOutput:(306, 4)

Specify column names:

# (Q) how to specify column names to data set manuallycolumn_names = ['Age', 'Year', 'Positive_Axillary_Nodes', 'Survival_Status']
haberman.columns = column_nameshaberman.head()

# specifying column names at the time of loading data set
# haberman = pd.read_csv('haberman.csv', header=None, names=['Age', 'Year', 'Positive_Axillary_Nodes', 'Survival_Status'])

Now you have your data set with all the data points and features/column names defined.

# (Q) What are the column names in our dataset?haberman.columnsOutput:Index(['Age', 'Year', 'Positive_Axillary_Nodes', 'Survival_Status'], dtype='object')

Default pair plot behavior:

# https://seaborn.pydata.org/generated/seaborn.pairplot.html
# plot pair-wise relationships in a dataset
# Pair Plots are a really simple way to visualize relationships between two variables. 
# It produces a matrix of relationships between each variable in your data for an instant examination of our data.sns.pairplot(haberman)# Note: Draw scatter plots for joint relationships and histograms for uni-variate distributions.

Note: By default, Seaborn pairplot plots relationship for all numerical columns in data set. Meaning, it considers all the numerical columns for plotting. In our ‘Haberman’ data set, the Survival_Status is numerical.

However, typically, we want to determine relationship between independent variables only. Also, while plotting the relationship between variables, we want to visualize the data points with different color coding based on the dependent variable category.

Inspect the structure of data:

# (Q) inspect the structure of the data set
haberman.info()Output:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
Age                        306 non-null int64
Year                       306 non-null int64
Positive_Axillary_Nodes    306 non-null int64
Survival_Status            306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

Check for missing values:

# check for missing values
haberman.isnull().sum()Output:Age                        0
Year                       0
Positive_Axillary_Nodes    0
Survival_Status            0
dtype: int64

This tell us that, there are 4 columns/features in the data set and there are no missing values for the columns. So there is no need to do data imputation.

Also, it tells that, all are numerical columns. The datatype of ‘Survival_Status’ column is integer. Oh! but data set description says, the last column (Survival_Status) is a class attribute. Can we map the column to make it categorical? So that, Survival_Status is mapped to ‘yes’ (survived after 5 years) and ‘no’ (not survived after 5 years)

Map feature to categorical type

Let’s first find how many unique values exist for my class label.

# print the unique values of the class label columnprint(list(haberman['Survival_Status'].unique()))Output:[1, 2]

Modify the target column/class label to be meaningful as Categorical.

# modify the target column values to be meaningful as well as categorical
haberman['Survival_Status'] = 
               haberman['Survival_Status'].map({1: "Yes", 2: "No"})
haberman['Survival_Status'] =
               haberman['Survival_Status'].astype('category')haberman.head()

Check the structure of data again:

haberman.info()Output:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
Age                        306 non-null int64
Year                       306 non-null int64
Positive_Axillary_Nodes    306 non-null int64
Survival_Status           306 non-null category
dtypes: category(1), int64(3)
memory usage: 7.6 KB

You can now observe that the ‘Survival_Status’ is no longer a numerical column, but a ‘Categorical’ column.

Distribution of classes in the data set:

We’ll now figure out how many data points exist for each class label/target column.

#(Q) How many data points for each class are present for Survival_Status column?
 
haberman["Survival_Status"].value_counts()Output:Yes    225
No      81
Name: Survival_Status, dtype: int64

Plot the histogram of classes:

# plot the histogram for the classes - binary classification, only two classescount_classes = pd.value_counts(haberman["Survival_Status"])
count_classes.plot(kind = 'bar')
plt.title("Class distribution Histogram")
plt.xlabel("Survival Status")
plt.ylabel("Frequency")
plt.show()

Distribution of classes as a % :

# percentage of classes
# this gives us the distribution of classes in the data sethaberman["Survival_Status"].value_counts(1)Output:Yes    0.735294
No     0.264706
Name: Survival_Status, dtype: float64

This shows that, we have around 74% people who have survived 5 years or longer (Survival_Status = Yes) and around 26% of people not survived and died within 5 years (Survival_Status = No). The target column is imbalanced with 74% of values are ‘yes’.

Descriptive statistics on data:

# (Q) High Level Statistics
# Since Survival_Status is a categorical variable, this column will not be shown here.
# By default, describe() function displays descriptive statistics for numerical columns onlyhaberman.describe()

The age of the patients vary from 30 to 83 with the median of 52.
Although the maximum number of positive axillary nodes observed is 52, nearly 75% of the patients have less than 5 positive axillary nodes and nearly 25% of the patients have no positive axillary nodes
The dataset contains only a small number of records (306).

Note: By default, the describe() method won’t include categorical variable for calculating the descriptive statistics. To show the descriptive statistics, we have to use ‘include’ parameter.

# include descriptive statistics for categorical columns alsohaberman.describe(include='all')

We can have descriptive statistics for specific columns as well:

# descriptive statistics only for categorical variablehaberman['Survival_Status'].describe()Output:count     306
unique      2
top       Yes
freq      225
Name: Survival_Status, dtype: object

The objective of our EDA on the data set is to predict whether the patient will survive after 5 years or not based upon the patient’s age, year of treatment and the number of positive axillary nodes. We want to determine if any relationship exists between feature variables or not and how they affect the target column.

Data Visualization — Uni-variate Analysis

Distribution plots

Distribution plots are used to visually assess how the data points are distributed with respect to its frequency.
Usually the data points are grouped into bins and the height of the bars representing each group increases with increase in the number of data points lie within that group. (histogram)
Probability Density Function (PDF) is the probability that the variable takes a value x. (smoothed version of the histogram)
Kernel Density Estimate (KDE) is the way to estimate the PDF. The area under the KDE curve is 1.
Here the height of the bar denotes the percentage of data points under the corresponding group

KDE — We draw Gaussian kernels for each data point in the histogram and sum up the Gaussian kernel values at each point and plot the curve.

# Univariate analysis - plotting distribution
sns.FacetGrid(haberman, hue="Survival_Status", size=5) \
      .map(sns.distplot, "Age") \
      .add_legend();
plt.show();

Distribution plot for all features:

# plotting distribution plot for all featuresfor idx, feature in enumerate(list(haberman.columns)[:-1]):
    fg = sns.FacetGrid(haberman, hue='Survival_Status', size=5)
    fg.map(sns.distplot, feature).add_legend()
    plt.show()

By looking at their PDF graphs and the amount of separation and overlapping between different classes, we can decide which features gives useful insight and choose that feature.

From the univariate distribution, we can see that ‘Positive_Axillary_Nodes’ is important feature that determines our dependent variable ‘Survival_Status’. The second important feature is ‘Age’.

Observation: The number of positive lymph nodes of the survivors is highly densed from 0 to 5.

PDF & CDF plots for all features

# CDF - The cumulative distribution function (cdf) is the probability that 
# the variable takes a value less than or equal to x.plt.figure(figsize=(20,5))for idx, feature in enumerate(list(haberman.columns)[:-1]):
    
    plt.subplot(1, 3, idx+1)
    print("********* "+feature+" *********")
    
    counts, bin_edges = 
              np.histogram(haberman[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    plt.xlabel(feature)
Output:********* Age *********
Bin Edges: [30.  35.3 40.6 45.9 51.2 56.5 61.8 67.1 72.4 77.7 83. ]
PDF: [0.05228758 0.08823529 0.1503268  0.17320261 0.17973856 0.13398693
 0.13398693 0.05882353 0.02287582 0.00653595]
CDF: [0.05228758 0.14052288 0.29084967 0.46405229 0.64379085 0.77777778
 0.91176471 0.97058824 0.99346405 1.        ]
********* Year *********
Bin Edges: [58.  59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]
PDF: [0.20588235 0.09150327 0.08496732 0.0751634  0.09803922 0.10130719
 0.09150327 0.09150327 0.08169935 0.07843137]
CDF: [0.20588235 0.29738562 0.38235294 0.45751634 0.55555556 0.65686275
 0.74836601 0.83986928 0.92156863 1.        ]
********* Positive_Axillary_Nodes *********
Bin Edges: [ 0.   5.2 10.4 15.6 20.8 26.  31.2 36.4 41.6 46.8 52. ]
PDF: [0.77124183 0.09803922 0.05882353 0.02614379 0.02941176 0.00653595
 0.00326797 0.         0.00326797 0.00326797]
CDF: [0.77124183 0.86928105 0.92810458 0.95424837 0.98366013 0.99019608
 0.99346405 0.99346405 0.99673203 1.        ]

In the above plots, the blue line indicates PDF. The orange line indicates CDF for all data points.

From the PDF plot, we can see that almost 80% of the patients have positive lymph nodes less than 10.

PDF & CDF plots for all features based on class label

Separate the data sets based on the class label

survived = haberman[haberman['Survival_Status'] == 'Yes']
notsurvived = haberman[haberman['Survival_Status'] == 'No']

Plot PDF & CDF for all features based on the class label type.

plt.figure(figsize=(20,5))for idx, feature in enumerate(list(haberman.columns)[:-1]):
    
    plt.subplot(1, 3, idx+1)
    print("********* "+feature+" *********")
    
    # PDF & CDF for Survived class
    counts, bin_edges = 
           np.histogram(survived[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    
    # PDF & CDF for not Survived class
    counts, bin_edges = 
           np.histogram(notsurvived[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    plt.xlabel(feature)

Observations: Almost 80% of the survived patients have less than or equal to 5 positive lymph nodes.

ECDF plot

def ecdf(data):    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)    # x-data for the ECDF: x
    x = np.sort(data)    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n    return x, y

Computing ECDF plot for a feature variable:

# Compute ECDF for Postive_Axillary_Nodes data: x_vers, y_vers
x_vers, y_vers = ecdf(haberman['Positive_Axillary_Nodes'])# Generate plot
plt.plot(x_vers, y_vers, marker='.', linestyle='none')# Label the axes
plt.xlabel('Positive_Axillary_Nodes')
plt.ylabel('ECDF')# Display the plot
plt.show()

Box plots

Box plot takes a less space and visually represents the 5 number summary of the data points in a box. The outliers are displayed as points outside the box.

Box plots typically detail the minimum value, 25th percentile (aka Q1), median (aka 50th percentile, Q2), 75th percentile (aka Q3) and the maximum value in a visual manner.

Q1–1.5*IQR
Q1 (25th percentile)
Q2 (50th percentile or median)
Q3 (75th percentile)
Q3 + 1.5*IQR

Inter Quartile Range = Q3 - Q1

Note: Just note, width of box plot has no significance.

# box plot for all independent variablesfig, axes = plt.subplots(1, 3, figsize=(15, 5))for idx, feature in enumerate(list(haberman.columns)[:-1]):
    sns.boxplot( x='Survival_Status', y = feature, 
                 data=haberman, ax=axes[idx])
plt.show()

Violin plots

Violin plot is the combination of box plot and probability density function.

A violin plot is a method of plotting numeric data. It is similar to a box plot with a rotated kernel density plot on each side.

A violin plot combines the benefits of both the histogram & PDF curve and the box plots.

Violin plot is a combination of a Box Plot and a Density Plot that is rotated and placed on each side, to show the distribution shape of the data.

The thick black bar in the center represents the inter-quartile range, the thin black line extended from it represents the 95% confidence intervals, and the white dot is the median

# violin plot for all independent variablesfig, axes = plt.subplots(1, 3, figsize=(15, 5))for idx, feature in enumerate(list(haberman.columns)[:-1]):
    sns.violinplot( x='Survival_Status', y=feature, 
                    data=haberman, ax=axes[idx])
plt.show()

Observation: The patients treated after 1966 have the slightly higher chance to survive that the rest. The patients treated before 1959 have the slightly lower chance to survive that the rest.

Strip plot — 1-D scatter plot

It produces one dimensional scatter plots (or dot plots) of the given data.

A strip plot is a scatter plot where one of the variables is categorical and we can group data based on this categorical variable.

# strip plot for all independent variablesfig, axes = plt.subplots(1, 3, figsize=(15, 5))for idx, feature in enumerate(list(haberman.columns)[:-1]):
    sns.stripplot( x='Survival_Status', y=feature, 
                   data=haberman, ax=axes[idx], jitter=True)
plt.show()

Bee Swarm plots

A Beeswarm/Swarm plot is a two-dimensional visualization technique where data points are plotted relative to a fixed reference axis so that no two data points overlap.

The Beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values.

The swarm plots are more detailed than histogram, cause all data is displayed.

# swarm plot for all independent variablesfig, axes = plt.subplots(1, 3, figsize=(15, 5))for idx, feature in enumerate(list(haberman.columns)[:-1]):
    sns.swarmplot( x='Survival_Status', y=feature, 
                   data=haberman, ax=axes[idx])
plt.show()

From the above plot, we can observe that, as the patient year of operation is the latest, survival is little better.

Multi-variate Analysis

Scatter plots

Scatter plots helps to identify the existence of a measurable relationship between two features by measuring them in pairs and plotting them on a graph, as below. This visually shows the correlation between the two features.

The Quality Toolbook: How to understand the Scatter Diagram

Here are full details on how to understand the Scatter Diagram.

www.syque.com

# 2-D Scatter plots are used to visualize relationship between two variables only
# Here 'sns' corresponds to seaborn.# 2-D Scatter plot with color-coding for each Survival type/class.
sns.set_style("whitegrid");sns.FacetGrid(haberman, hue="Survival_Status", size=4) \
   .map(plt.scatter, "Age", "Positive_Axillary_Nodes") \
   .add_legend();
plt.show();

Pair plots

Pair plot in seaborn plots the scatter plot between every two data columns in a given dataframe. It is used to visualize the relationship between two variables.

# after we have made the categorical variable 'Survival_Status' as of type 'category', 
# the default sns pairplot won't show that feature now.sns.pairplot(haberman)

Let’s plot the data points using color encoding, based on the class label.

sns.pairplot(haberman, hue = 'Survival_Status', size = 3)
plt.show()

By scattering the data points between Year of treatment and Positive Axillary Nodes, we can see the better separation between the two classes than other scatter plots.

Convert Categorical variable to Numeric (Label Encoding)

print(list(haberman['Survival_Status'].unique()))Output:['Yes', 'No']

Let’s convert the categorical column to numerical using Label Encoder.

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
le.fit(haberman['Survival_Status'])haberman['Survival_Status'] = 
                      le.transform(haberman['Survival_Status'])# check the structure of data
haberman.info()Output:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
Age                        306 non-null int64
Year                       306 non-null int64
Positive_Axillary_Nodes    306 non-null int64
Survival_Status            306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

Checking the new values for categorical column after encoding:

# print the unique values of the class label column
print(list(haberman['Survival_Status'].unique()))Output:[1, 0]

Different classes encoded using Label Encoder:

le.classes_Output:array(['No', 'Yes'], dtype=object)

Data points for each class:

#(Q) How many data points for each class are present for Survival_Status column? 
haberman["Survival_Status"].value_counts()Output:1    225
0     81
Name: Survival_Status, dtype: int64

Correlation Matrix:

# generate correlation matrix
haberman.corr() # for this to work, all columns should be numerical

Correlation Heatmap:

# look at the heatmap of the correlation matrix of our dataset
sns.heatmap(haberman.corr(), annot = True)

Correlation of independent variables against the class label

haberman.corr()['Survival_Status'] # numerical correlation matrix

Output:

Age                       -0.067950
Year                       0.004768
Positive_Axillary_Nodes   -0.286768
Survival_Status            1.000000
Name: Survival_Status, dtype: float64