Simple Machine Learning Model on Breast Cancer Dataset

Olutola Awosiku
The Startup
Published in
7 min readDec 29, 2020
credit : Journal of public health research

According to Nigeria National Cancer Control Plan report Breast cancer ranks as the fifth cause of death from cancer overall (522,000 deaths) and while it is the most frequent cause of cancer death in women in less developed regions (324,000 deaths, 14.3% of total), it is now the second cause of cancer death in more developed regions (198,000 deaths, 15.4%) after lung cancer.

Delays in access to cancer treatment result in 80–90% of cases that are in an advanced stage at the time of arrival to treatment.

A tumor can be benign (not dangerous to health) or malignant(has the potential to be dangerous). The early diagnosis of Breast Cancer can improve the prognosis and chance of survival significantly. Further accurate classification of benign tumors can prevent patients undergoing unnecessary treatments.

This article objective is to show how to design a simple machine learning algorithm that is able to correctly classify whether the tumor is benign or malignant using the UCI Machine Learning Repository Breast cancer dataset.

STAGE 1.0 : DATA PREPARATION

Firstly we will import the necessary libraries and our dataset to the jupyter notebook.

fig : top 7 data from the dataset

Next we use pandas “shape” attribute to access the dimension of the dataset i.e how many rows and columns our dataset contains

fig : Dimension of dataset

Our dataset contains 569 rows and 33 columns ,which means we have 569 patients and 33 variables. The variables are described below :

  • Id: ID number of patients
  • diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)

- radius_mean: mean of distances from center to points on the perimeter

  • texture_mean: standard deviation of gray-scale values
  • perimeter_mean: mean size of the core tumor
  • area_mean: (no description provided)
  • smoothness_mean: mean of local variation in radius lengths
  • compactness_mean: mean of perimeter² / area — 1.0
  • concavity_mean: mean of severity of concave portions of the contour
  • concave points_mean: mean for number of concave portions of the contour
  • symmetry_mean: (no description provided)
  • fractal_dimension_mean: mean for “coastline approximation” — 1
  • radius_se: standard error for the mean of distances from center to points on the perimeter
  • texture_se: standard error for standard deviation of gray-scale values
  • perimeter_se: (no description provided)
  • area_se: (no description provided)
  • smoothness_se: standard error for local variation in radius lengths
  • compactness_se: standard error for perimeter² / area — 1.0
  • concavity_se: standard error for severity of concave portions of the contour
  • concave points_se: standard error for number of concave portions of the contour
  • symmetry_se: (no description provided)
  • fractal_dimension_se: standard error for “coastline approximation” — 1
  • radius_worst: “worst” or largest mean value for mean of distances from center to points on the perimeter
  • texture_worst: “worst” or largest mean value for standard deviation of gray-scale values
  • perimeter_worst: (no description provided)
  • area_worst: (no description provided)
  • smoothness_worst: “worst” or largest mean value for local variation in radius lengths
  • compactness_worst: “worst” or largest mean value for perimeter² / area — 1.0
  • concavity_worst: “worst” or largest mean value for severity of concave portions of the contour concave
  • points_worst: “worst” or largest mean value for number of concave portions of the contour
  • symmetry_worst: (no description provided)
  • fractal_dimension_worst: “worst” or largest mean value for “coastline approximation” — 1

Our target variable is “diagnosis” which can be classified into Benign(B) or Malignant(M) . Lets identify how many of the patients fall into each category.

fig : Number of malignant and benign

We can also visualize this using the seaborn library

Fig : visualization of number of benign and malignant

We observe that out of 569 patients ,357 are labeled benign(B) and 212 malignant(M).

To get an insight into the statistical summary of the dataset ,we use the pandas “describe” function .

fig : statistical summary of dataset

Missing or Null value : We can find any missing or null data points of the data set (if there is any) using the following pandas function.

dataset.isnull().sum()
dataset.isna().sum()

fig : observing missing value in the dataset

From the above figure we can observe that there is no missing value or null value in the dataset.

STAGE 2.0 : LABEL ENCODING

In many Machine-learning or Data Science activities, the data set might contain text or categorical values (basically non-numerical values). For example, color feature having values like red, orange, blue, white etc. Sex feature have values like female,male etc . Lets observe the types of data we have in our dataset.

fig : data types of our data

From the above figure we can observe the “diagnosis” variable is the only categorical data.We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python used to convert categorical data, or text data, into numbers, which our predictive models can better understand.

fig : dataset without encoding
fig : encoded categorical data
fig: dataset after label encoding

From the above figure ,we can observe that the diagnosis variable has been encoded and benign category is now labelled numerically as “0” and malignant as “1”

STAGE 3.0 : DATA VISUALIZATION

Data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Python has several interesting visualization libraries such as Matplotlib, Seaborn etc.

In this article we will use pandas’ visualization which is built on top of matplotlib, to find the data distribution of the features.

fig : data visualization

We can Find the correlation between the mean features to understand the relationship between them

fig : correlation between the mean features
fig : heatmap to visualize the correlation

STAGE 4.0 : SPLITTING THE DATASET

To train any machine learning model irrespective what type of dataset is being used you have to split the dataset into training data and testing data. In this article we will split the dataset using the ‘train_test_split’ library from sklearn.

fig: splitting dataset

Here I have used the ‘train_test_split’ to split the data in 75:25 ratio i.e. 75% of the data will be used for training the model while 25% will be used for testing the model that is built out of it.

STAGE 5.0 : FEATURE SCALING

We need to bring all features to the same level of magnitudes because Most of the times, dataset will contain features highly varying in magnitudes, units and range. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1.

STAGE 6.0 : MODEL SELECTION

All machine learning models are categorized as either supervised or unsupervised. If the model is a supervised model, it’s then sub-categorized as either a regression or classification model. If the model is unspervised ,it is then sub categorized as either clustering or dimensionality reduction.

For our dataset which have only two set of values, either M (Malign) or B(Benign). So we will use Classification algorithm of supervised learning.

Types include:1. Logistic Regression 2. Nearest Neighbor 3. Support Vector Machines 4. Kernel SVM 5. Naïve Bayes 6. Decision Tree Algorithm 7. Random Forest Classification

fig : model accuracy on train data

Now we will predict the test dataset and check the accuracy with each of our model. Accuracy is the ratio of number of correct predictions to the total number of input samples.A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

fig : Confusion Matrix

After applying the 3 classification models we have the following accuracies

  1. Logistics regression = 95%
  2. Decision Tree Classifier =93%
  3. Random Forest Classifier =96%

We can see that Random Forest Classification algorithm gives the best results for our dataset. Lastly I applied the algorithm to predict the test data-set ,the result is as shown below.

This is a basic application of Machine Learning Model to any data-set.I hope you find it useful.Thanks for reading.

--

--