Exploring Breast Cancer Data set

Published in

Data Science 101

4 min readApr 29, 2018

Visualising and exploring Breast Cancer data set to predict cancer

Hello everyone!

This is my first blog of Machine learning which will help you understand how important it is to analyse a data set before we implement any algorithm in machine learning.

Now, you may ask how ? Let me show you. Analysing a data set, unlike traditional programming, in Machine Learning one can spend months on a project with no results to show. Jumping directly into implementation of algorithm, which you might feel might work, without analysing it is a big pothole.

In this post I’ll try to outline the process of visualisation and analysing a dataset.

Data set: breast-cancer-wisconsin.csv
Source : https://github.com/jeffheaton/aifh/blob/master/vol1/python-examples/datasets/breast-cancer-wisconsin.csv
Description : This dataset helps you out to make a classification on breast cancer, have a quick glimpse on top five rows of data sets

Probable like you, I am not a cancer specialist. But let’s pretend to understand that the features in the dateset are sufficient to predict the stage of a cancer patient.

Task: Classify the cancer stage of a patient using various features in the dataset

Before we jump on to using some kind of regression algorithm, here is what I would do to gain an intuition/insight into the problem statement:

Let us first clean up some data.

Importing the csvdata to Pandas dataframe

This doesn’t ends here. Probably,you need to sweat more to clean the data.The cleaning of real life data has always been a big pain to us, still we will try to cover in later posts.Still just for the taste, cleaning of data deals with handling null values, zeros, or special characters (“?”).

Handling NA in the dataset

2. Start with a Heat Map for some initial intuition.

Now where does this comes from? Well, just to understand which attribute(parameter) is co-related with other, we need to understand the concept behind correlation among attributes.To understand this better,this is where Heat Map comes into play.

This is how a heat-map looks like

What we need to understand here the co-relation among every attributes, where +1 shows the highest positive co-relativity and -1 being the negative co-relativity.

Let’s focus on the square where attribute size_uniformity of X-axis and shape_uniformity of Y -axis meet that is 0.91, which shows that these two attributes are highly co-related to each other. In more simple words, the value of size_uniformity increases when the value of shape_uniformity increases,had it been -0.91 again they are highly co-related but this time one increases when another decreases.

3. Let’s play with other attributes as well…using a bar plot.

Before I show you the output, try to visualise it. I am taking a column (bland_chromatin) on X axis and trying to predict the outputs on Y axis. That means I’ll get a graph which will shows how many people of each category in bland_chromatin will fall in class 2 or class 4….remember…class 2 means patient is in early stages of cancer while class 4 is malevolent.

Observation : From the graph it is clear to me that when Bland Chromatin is in range in either 1 ,2 ,or 3. The 150,160,130 no. of patient are in benign stage but as soon as the ranges exceeds from 3 to 7 , it is seen that the no of patient are falling in danger situation but still few cases are safe. Once range exceeds 7, it is found no patient was in safe state and hence range 8 ,9 and 10 there were no case who was safe.

4. How does all that help?

Many machine learning projects fail, some succeed. What do you think is the main difference?

Features used — have to be the most important factor.

That’s what any Machine Learning algorithm is trying to do — learn a set of features, so that it can make an accurate prediction based on that.

So let me quickly put all the story in few lines……

gain an intuition to what could be a good algorithm to start off with
helps us in Feature Engineering to come up with new features (when the 1st attempt gives unsatisfactory results, which it will, almost all the time)
helps us develop a mental model in our minds, of what kind of data and problem we are dealing with — this helps us make better decisions throughout the process, instead of just shooting in the dark and praying something groundbreaking comes out.

You can access the complete code and the dataset here

Thanks you for your patience …..Claps (Echoing)

Exploring Breast Cancer Data set

Written by Anchit Jain