Exploratory Data Analysis

Bacem Etteib
bacemtayeb
Published in
6 min readJul 26, 2018

This is a blog for people new to Data Science, like me. I hope we learn together through this process. My personal interests lie heavily in Data Analytics and Visualization, so keep an eye out for more blogs on the same.

The image is taken from Green Book

Article outline

1. Definition
2. Prerequisites
3. Data Loading and Preparation : Basic steps
4. Discovering hidden patterns and creating insights
5. Applying our model
6. Useful resources

1-Definition

The field of Data Science is constantly growing enabling businesses to become more data-driven with better insights and knowledge. According to Harvard, data science is “ the sexiest job of the 21st century”. A century ago, the resource in question was oil. Now similar concerns are being raised by the giants that deal in data, the oil of the digital era.

The image is taken from Know Your Meme

2-Prerequisites

There’s one thing to do before we get started, however: learn about pandas, the linchpin of the Python data science ecosystem. pandas contains all of the data reading, writing, and manipulation tools that you will need to probe your data, run your models, and of course, visualize. If you are totally unfamiliar with pandas I firmly suggest that you take a look at this course : Pandas Tutorial.

3-Data Loading and Preparation

We will be using Haberman dataset for this tutorial as I find it incredibly useful for EDA. This dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

We start by importing basic libraries and loading the data.

#Pandas for data processing
#Numpy for numerical operations
#Seaborn for visualization
#sklearn for ML techniques
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.naive_bayes import GaussianNB
#Loading our data
data = pd.read_csv('../habermans.csv')
df = pd.DataFrame(data)

Understanding the Data

Before diving deep into details and performing different operations, it’s better to get some useful information about our data, i.e the number of columns, observations, statistical insights and total null values.

All columns in list format

The first thing to notice is that our data comes with no headers. A small search gave me the following labels :

Data labels found in description

Super! Let’s assign those headers to our dataset.

A quick look at our data [1]

4. Discovering hidden patterns and creating insights

Useful statistical insights [2]
All columns information [3]

Hopefully, our data is clean and has no missing values. This can be checked through the count row or even through running the code below. This means that we won’t go through the hustle of imputation. If this term sounds unusual, don’t worry, I will explain it in details in next posts.

No missing data

The Survived column is the target variable. If survival = 1 the patient survived, otherwise he’s dead. This is the variable we’re going to predict. This doesn’t make sens as we want to assign the values of 1 and 0 instead. So after running our code we obtain the following results where 1 means alive and 0 means dead. We have also added a new column ‘status’ for better interpretation.

Observations and questions

1. The target variable is imbalanced : 75% survived after 5 years of treatment.
2. The age of patients varies from 30 to 83. Most of patients are 61 year old.
3. While the maximum number of nodes is 52 , more than 75% patients have less than 5 nodes.

Now we want to answer some basic questions :

1. How are the variables in the dataset related ? Are they linearly separable?
2. At which age do patient often die ?
3. What’s the impact of axial nodes on the survival status ?

Variables correlation coefficient

In statistics, the correlation coefficient measures the strength and the direction of liner relationship between a pair of variables. It’s value is always between +1 and -1. A strong relation is often between +-70 and +-1. Pandas allows us to compute pairwise correlation of columns. Los geht’s!

As we can see in the figure above, our variables aren’t really linearly related. Therefore, using Linear Regression won’t be a good idea. This gives us an idea about which model we’re going to use later on.

Age & Survival Status

The impact of cancer treatment seems to be positive as more people tend to survive after 5 years of treatment.

Perfect! Let’s now make some visualization. We we’ll be using seaborn library for this task.

Relation between survival status, age and year

Through the charts above, it turns out that more patients pass away between the age of 35 and 55 while others survive. Furthermore, the black year of breast cancer is 1963 as the number of deaths is extremely high. Therefore, the patients treated after 1963 have a higher chance to survive than the rest.

We can verify what the graphs showed us through the code below.

At which age do breast cancer patients die often?
At what year of treatment did most of breast cancer patients die?

The charts gave us useful information about the relation between year of treatment, age and survival status. However, this is still insufficient since we want to know the impact of the number of positive axillary nodes detected and the patient status. To do this, we will use box plot for better visualization.

Box plots

Observations

1. The number of positive nodes is between 0 and 10 for most of patients.
2. Patients who have a number of nodes between 0 and 5 survive more than those who have number is higher than 5.
3. The typical number of axial nodes is 2

5. Applying our model

We will be using Gaussian Naive Bayes as our ML classifier.

Some of its advantages are:

  • Very simple, easy to implement and fast.
  • Needs less training data.
  • Not sensitive to irrelevant features.

Luckily, our dataset includes only numerical values so pandas values function will work like a charm. We start by returning a numpy representation of our data frame. Besides, we will extract features and labels. Our target variable is survival status and our features are the rest of columns. For the sake of simplicity, we’ll be using random forest to split our data. This will save us a lot of time.

Features + labels + random forest

Now it’s time for some fun! Let’s create our classifier and feed the data.

Gaussian NB classifier

Sublime! To see how well is our classifier performing, sklearn provides us with the accuracy_score function. Running the code below, we got a score of 78%. Not bad as a start!

That’s all with machine learning today! Not enough? You can go back later on for more tutorials.

6. Useful resources

You can find the full code on my github. Herein, I included some useful resources that you can check for further details.

1. Gaussian NB
2. Pandas
3. Data Preparation

--

--