Detecting Breast Cancer using Machine Learning

Published in

Analytics Vidhya

9 min readApr 12, 2020

Applying Decision Trees on Breast Cancer Wisconsin (Diagnostic) Database

Disclaimer: This is a tutorial which could be recreated by anyone with a basic knowledge of python programming and information theory. The inspiration for this article comes from coursework I completed at the University of Bath. Therefore, some code snippets have been directly taken from it (indicated by *** on top of the snippet). The complete code can be found here.

Introduction to Decision Trees

Decision trees are a powerful and simple method for multiple variable analysis. They can be used to easily substitute or complement traditional multiple linear regression and produce results that can be communicated in symbolic and visual terms which is helpful in understanding outcome dependency [3].

A decision tree has the following characteristics and components [4]:

Nodes: A node can be either of the three types — root/decision node, internal/chance nodes and leaf/end nodes.
Branches: The path from the root node is connected to the leaf node using classification decision rule through branches.
Splitting: The process of dividing the input variables based on their importance is called splitting.
Stopping: These strategies prevent overfitting of the model over the input data which leads to poor generalization.

Breast Cancer Wisconsin (Diagnostic) Database

In this exercise, we have used the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset from here.

Data Set Characteristics: Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry 
    - fractal dimension ("coastline approximation" - 1)    The mean, standard error, and "worst" or largest (mean of the three
    largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 3 is Mean Radius, field
    13 is Radius SE, field 23 is Worst Radius.    - target class:
            - WDBC-Malignant
            - WDBC-Benign:Summary Statistics:===================================== ====== ======
                                       Min    Max
===================================== ====== ======
radius (mean):                        6.981  28.11
texture (mean):                       9.71   39.28
perimeter (mean):                     43.79  188.5
area (mean):                          143.5  2501.0
smoothness (mean):                    0.053  0.163
compactness (mean):                   0.019  0.345
concavity (mean):                     0.0    0.427
concave points (mean):                0.0    0.201
symmetry (mean):                      0.106  0.304
fractal dimension (mean):             0.05   0.097
radius (standard error):              0.112  2.873
texture (standard error):             0.36   4.885
perimeter (standard error):           0.757  21.98
area (standard error):                6.802  542.2
smoothness (standard error):          0.002  0.031
compactness (standard error):         0.002  0.135
concavity (standard error):           0.0    0.396
concave points (standard error):      0.0    0.053
symmetry (standard error):            0.008  0.079
fractal dimension (standard error):   0.001  0.03
radius (worst):                       7.93   36.04
texture (worst):                      12.02  49.54
perimeter (worst):                    50.41  251.2
area (worst):                         185.2  4254.0
smoothness (worst):                   0.071  0.223
compactness (worst):                  0.027  1.058
concavity (worst):                    0.0    1.252
concave points (worst):               0.0    0.291
symmetry (worst):                     0.156  0.664
fractal dimension (worst):            0.055  0.208
===================================== ====== ======:Missing Attribute Values: None:Class Distribution: 212 - Malignant, 357 - Benign:Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian:Donor: Nick Street:Date: November, 1995

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [1], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1–4 features and 1–3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in [2].

This database is also available through the UW CS ftp server here.

Import essential libraries and data

The first step of any machine learning problem is to load the data. In this tutorial, we will be using a built-in dataset provided by the scikit learn package.

%matplotlib inline
import numpy as np

from sklearn import datasets as ds
from sklearn.decomposition import PCA
from sklearn import preprocessing

import matplotlib.pyplot as plt
import mathdata_all = ds.load_breast_cancer()

x = data_all.data
y = data_all.target

y_names = data_all.target_names 

feature_names = data_all.feature_names

Data Preparation

The code block splits the data and the targets into training and test sets; 60% for training, 40% for test.

split = int(x.shape[0] * 0.6)

x_train = x[:split,:]
y_train = y[:split]

x_test = x[split:,:]
y_test = y[split:]

Data Visualisation

The dataset has a feature dimensionality of 30 which is obviously difficult to visualise. Therefore, we have used a dimensionality reduction technique called Principal Component Analysis (PCA).

Let us consider an array an array in R^{nxd} with a matrix of size n X d of real entries. With n and d being the number of data points and the feature dimensionality, respectively, PCA will output an array in R^{nxm}, with m<d.

In order to be able to visualise the data on a 2D plot, we choose m=2.

import matplotlib.patches as mpatches

pca = PCA(n_components=2)
x_scaled = preprocessing.scale(x[:,:-1]) 
x_reduced = pca.fit_transform(x_scaled)plt.figure()

for target in range(len(y)):
    if y[target] == 0:
        # Malignant
        plt.scatter(x_reduced[target,0], x_reduced[target,1], color = 'b')     

    if y[target] == 1:
        # Benign
        plt.scatter(x_reduced[target,0], x_reduced[target,1], color = 'r')
        


red_patch = mpatches.Patch(color='r', label='Benign') # red label for benign
blue_patch = mpatches.Patch(color='b', label='Malignant')       # blue label for malignant
plt.legend(handles=[blue_patch, red_patch])plt.title('Dataset Visualisation using PCA')
plt.xlabel('Reduced Feature 1')
plt.ylabel('Reduced Feature 2')

plt.show()

Decision Tree

1. Calculation of Entropy

The following function will be used for calculating entropy which can be represented as

The input is a column vector of target class values, and the output is its entropy. y is an n X 1 sized matrix where n is the number of data points (targets). The return is a scalar.

def calculate_entropy(y):
    
    sort = np.unique(y, return_counts=True)   
    # summation variable           
    p_sum = 0                                            
    for i in range(len(sort[0])):        #probability value        prob= (sort[1][i]).astype(float)/(float(569))    
        p_sum += prob*math.log(prob)                     

    # final entropy
    entropy = -1*p_sum                                   
    return entropy

2. Finding Split

The function find_split() allows us to find the optimal feature and the best value to split the data into two chunks. Applying this to the original data set splits it into two new data sets. We can then repeat this on both of the new data sets to get four data sets, and so on. This recursion builds a decision tree. It needs a stopping condition, to prevent it from dividing the data forever, here we will use two:

Maximum depth: The tree is limited to be no deeper than a provided limit.
Perfection: If a node contains only one class then there is little point in splitting it further.

find_split(x, y) takes as input:

The data matrix of features, x in R^{nXd}. n is the number of data points and d is the feature dimensionality.
y, a column vector of size n containing the target value for each data point in x.

find_split(x, y) outputs 'best_split' which is a dictionary (see the last part of the below code) with the following keys and their corresponding values:

'feature': An integer indexing the attribute/feature chosen to split upon.
'split': The value/threshold of this feature to split at.
'infogain': A scalar representing the amount of information gained by splitting this way.
'left_indices': Indices of the exemplars that satisfy x[feature_index]<=split.
'right_indices': Opposite set of indices to left_indices.

In order to find a split, we first calculate a start entropy value for target variables and assign it as the best value. We find a random split and loop through the left and right branches and calculate the information gain. This process is repeated for each feature and the one with the maximum gain is selected for the split.

***def find_split(x, y):

    start_entropy = calculate_entropy(y)
    
    best = {'infogain' : -np.inf}
    
    for i in range(x.shape[1]):
        for split in np.unique(x[:,i]):
            
            left_indices = []
            right_indices = []
            
            for index in range(x.shape[0]):
                if x[index, i]>split:                      
                    right_indices.append(index)            
                else:
                    left_indices.append(index)             
            
            nl = float(len(left_indices))                  
            nr = float(len(right_indices))
            n = nl + nr                                    
            
            infogain = start_entropy - (nl/n)*calculate_entropy(y[left_indices]) - (nr/n)*calculate_entropy(y[right_indices])             if infogain > best['infogain']:
                best = {'feature' : i,
                        'split' : split,
                        'infogain' : infogain, 
                        'left_indices' : left_indices,
                        'right_indices' : right_indices}            
    return best

3. Building Tree

The function build_tree(x, y, max_depth) below constructs a tree. The inputs are:

The data matrix of features, x in R^{nXd}. n is the number of data points and d is the feature dimensionality.
y, a column vector of size n containing the target value for each data point in x.
The maximum depth of the tree, max_depth.

The output of this function is a dictionary. If it has generated a leaf node then the keys are:

'leaf' : True
'class' : The index of the class to assign to exemplars that land here.

If it has generated a split node then the keys are:

'leaf' : False
'feature': The feature to apply the split to.
'split': The split to test the exemplars feature with.
'infogain': The information gain of this split.
'left' : The left subtree, for exemplars where x[feature_index]<=split
'right' : The right subtree, for exemplars where x[feature_index]>split

In order to build a tree, we find the best split using find_split() function defined earlier. On either side of the split, we build new trees using the same build_tree() function which creates a recursion cycle. The process is terminated in two instances — when maximum depth is reached or all target variables are exhausted. It finally leads to the creation of a leaf node.

***def build_tree(x, y, max_depth = np.inf):
    
    if max_depth==1 or (y==y[0]).all():
        # Leaf node
        classes, counts = np.unique(y, return_counts=True)
        return {'leaf' : True, 'class' : classes[np.argmax(counts)]}
    
    else:
        move = find_split(x, y)
        
        left = build_tree(x[move['left_indices'],:], y[move['left_indices']], max_depth - 1)
        right = build_tree(x[move['right_indices'],:], y[move['right_indices']], max_depth - 1)
        
        return {'leaf' : False,
                'feature' : move['feature'],
                'split' : move['split'],
                'infogain' : move['infogain'],
                'left' : left,
                'right' : right}

4. Predicting Values

After building the tree we should be able to predict the class of a sample. We do that by propagating the sample through the tree, i.e. we check all the splitting conditions until the sample falls in a leaf node, in which case the class of the leaf node is attributed to the sample.

The recursive function predict(tree, samples) takes the constructed tree, a data matrix R^{nXd} representing many data points as input and recursively propagates it through the branches of the tree. The output of this function is an array containing the predictions for all the samples in our input data array.

***def predict(tree, samples):
    
    ret = np.empty(samples.shape[0], dtype=int)
    ret.fill(-1)
    indices = np.arange(samples.shape[0])
    
    def tranverse(node, indices):
        nonlocal samples
        nonlocal ret
        
        if node['leaf']:
            ret[indices] = node['class']
        
        else:
            going_left = samples[indices, node['feature']] <= node['split']
            left_indices = indices[going_left]
            right_indices = indices[np.logical_not(going_left)]
            
            if left_indices.shape[0] > 0:
                tranverse(node['left'], left_indices)
                
            if right_indices.shape[0] > 0:
                tranverse(node['right'], right_indices)
    
    tranverse(tree, indices)
    return ret

5. Accuracy Testing

The accuracy for any tree is calculated by determining the correct prediction percentage of the test as well as the training data. This is followed by a comparison study which leads to one of the scenarios — (i) comparable high prediction percentage (perfect model), (ii) high training prediction and low test prediction (overfitted model) and (iii) comparable low prediction percentage (unsuitable model).

def decision_tree_changable_depth(max_depth):
    
    tree = build_tree(x_train, y_train, max_depth)               
    pred_train = predict(tree, x_train)                           
    pred_test = predict(tree, x_test)                             

    predicted_train_correct = 0                                   
    
    for i in range(len(pred_train)):                             
        if pred_train[i]==y_train[i]:                            
            predicted_train_correct += 1                          
            
    predicted_test_correct = 0
    
    for i in range(len(pred_test)):
        if pred_test[i]==y_test[i]:
            predicted_test_correct += 1        
   
    print ('For Max Depth: ', max_depth)
    print('Correct Predictions for training data: ', predicted_train_correct)
    print('Correct Predictions for test data: ', predicted_test_correct)
    
    acc_train = 100 * predicted_train_correct/float(len(y_train))  
    acc_test = 100 * predicted_test_correct/float(len(y_test))
    
    
    print ('Accuracy of Training Data: ', acc_train, '%') 
    print ('Accuracy of Test Data: ', acc_test, '%')
    print ()
    
    return acc_train, acc_test

6. Finding the best parameter

In order to find the best decision tree for our dataset, we experiment with the depth of the tree — a good range to test is range(2,6).

best_max_depth = 0
best_acc_train = 0
best_acc_test = 0
for max_depth in range(2,6):
    acc_train, acc_test = decision_tree_changable_depth(max_depth)     
    if acc_test > best_acc_test:                                      
        best_max_depth = max_depth                               
        best_acc_train = acc_train
        best_acc_test = acc_test
        
print ('Best Max Depth is: ',  best_max_depth)
print ('Its training data accuracy is: ', best_acc_train, '%')
print ('Its test data accuracy is: ', best_acc_test, '%')

The result shows that the accuracy for the test data set is best for the decision tree with maximum depth of 4 nodes which is 93.86%.

This wraps up my tutorial on the decision trees. If you made it till here, thank you for supporting my work. I will be writing more content in the healthcare niche so stay tuned!

[1] K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97–101, 1992

[2] K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23–34

[3] Petri, C., 2020. Decision Trees. [online] Cs.ubbcluj.ro. Available at: http://www.cs.ubbcluj.ro/~gabis/DocDiplome/DT/DecisionTrees.pdf [Accessed 12 April 2020].

[4] Song, Y.Y. and Ying, L.U., 2015. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2), p.130.

[5] BBC News. 2020. AI ‘Outperforms’ Doctors Diagnosing Breast Cancer. [online] Available at: https://www.bbc.com/news/health-50857759 [Accessed 12 April 2020].