Machine Learning Starter with Breast Cancer Detection

Published in

Analytics Vidhya

6 min readDec 25, 2019

Start learning Machine Learning today with real-world problems !

Breast cancer is the most commonly occurring cancer in women and the second most common cancer overall. There were over 2 million new cases in 2018, making it a significant health problem in present days.

The key challenge in breast cancer detection is to classify tumors as malignant or benign. Malignant refers to cancer cells that can invade and kill nearby tissue and spread to other parts of your body. Unlike cancerous tumor(malignant), Benign does not spread to other parts of the body and is safe somehow. Machine learning techniques can be used to improve the accuracy of early diagnosis significantly.

In this article, our goal is to classify tumors into malignant or benign using Machine learning techniques by using extracted dataset from cell images.

Phase 1: Environment Setup

You can install Anaconda on your system following this link. Now open your terminal and type jupyter notebook to launch the Jupyter Notebook App. The notebook interface will appear in a new browser window or tab. You can follow this document to get familiar with jupyter notebook.

Phase 2: Import Libraries and Dataset

#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns#import dataset
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

We can view all the data by runningcancerin the cell.

Phase 3: Data Visualization

To view the data in better format we can use cancer.keys(), it would return all the keys within the dictionary. We can view any particular column with print(cancer['DESCR]).

We will be using pandas DataFrame to present all our data. We will create a dataframe with our cancer data and target data. It would help us to store all the inputs and outputs in one dataframe. Next, we are appending cancer feature_names and targettogether.

df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']],columns = np.append(cancer['feature_names'], ['target']))df_cancer.head()

df_cancer.head()would return us first couple of rows with target data combined.

We are going to use pairplot to visualize all the data using seaborn library. We are using first six features to plot our data but you can try to use all 30 features.

sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness', 'mean compactness'])

Here, Blue points indicates Malignant case which is life threatening and Orange points represent Benign. To show correlation among the features we could use seaborn heatmap.

If we observe the heatmap , we can see all the values are between 0 and 1. The higher value is the higher correlation exists between those two features. You can go through the Correlation matrix to get the overview between the correlation values of the data.

Phase 4: Model Training

We need to drop the target column x = df_cancer.drop(['target'], axis=1) from our DataFrame to train our model. The next step is we will be defining our output/target column in y by executing y = df_cancer['target'].

In order to train out our model , we need to split our data in training data and testing data. After the model is trained , we will use the testing data to predict the cancer.

It is not possible for us to manually split our dataset also we need to split the dataset in a random manner. To help us with this task, we will be using a SciKit library named train_test_split. We will be using 80% of our dataset for training purpose and 20% for testing.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)

We need to use data normalization to range the all data between 0 and 1. This will help our model for better accuracy in training.

min_train = X_train.min()
range_train = (X_train - min_train).max()
X_train_scaled = (X_train - min_train)/ range_trainsns.scatterplot(x = X_train['mean area'], y = X_train['mean smoothness'], hue = y_train)sns.scatterplot(x = X_train_scaled['mean area'], y = X_train_scaled['mean smoothness'], hue = y_train)

Fig: Left(before normalization) and Right (After normalization)

We will also need to perform normalization on our testing dataset.

in_test = X_test.min()
range_test = (X_test - min_test).max()
X_test_scaled = (X_test - min_test)/ range_test

We are using Support Vector Classifire(SVC) in our model. It is a linear model classification and regression. It can solve both linear and non-linear problems. Support Vector Machine separate the two classes with Hyperplane. Our objective in Machine Learning is to generalize the model where it could identify most of the images whether it is Malignant and Benign even it has not been seen before.

Phase 5: Model Evaluation

We will use confusion matrix to show the results from our testing dataset and evaluate our model accordingly.

If our prediction result is in True class then we are all good. We can count the total number in true class which has been classified correctly. However, if our prediction tell us that the patient has cancer(FP) but the true class is negative which results in false prediction but it is not severe , it can be annotated as Type 1 error. Also, If our prediction tell us that the patient does not have cancer but true class is positive which results in false prediction , it can be annotated as Type 2 error. We need to avoid Type 2 error at all cost as it is a life threatening disease.

from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
svc_model = SVC()svc_model.fit(X_train_scaled, y_train)
y_predict = svc_model.predict(X_test_scaled)cm = confusion_matrix(y_test, y_predict)
ax = sns.heatmap(cm, annot = True)
ax.set_ylim(2.0, 0)

Fig: predicted result in confusion matrix

Here, we can see we did total 109 correct prediction and 5 Type 1 error. As mention, Type 1 error is not severe but we have successfully avoided all Type 2 error.

From the classification report, we can get the summary of our performance from our prediction. The accuracy of our model is 96% .

That is it, we have successfully created our program to detect breast cancer using machine learning. We are able to classify cancer effectively with our machine learning techniques.

Resources:

[1] https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

[2] https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm

[3] https://towardsdatascience.com/intro-to-data-analysis-for-everyone-part-3-d8f02690fba0

[4] https://www.learntek.org/blog/wp-content/uploads/2018/11/Ml-and-PR.jpg