Breast Cancer Detection-Using Machine Learning Algorithms

Bhaskar Borah
Analytics Vidhya
Published in
5 min readJan 19, 2020
Image taken from : https://www.youtube.com/watch?v=NSSOyhJBmWY

In this article , we will talk about detection method of Breast Cancer. Breast cancer is cancer that develops from breast tissue. If it is not identified in the early-stage then the result will be the death of the patient. Worldwide near about 12% of women affected by breast cancer and the number is still increasing.

So to tackle this problem , ML plays a very important role. ML algorithms helps to determine cells whether Malignant or Benign .ML algorithms can determine cancer cells more efficiently.

I have used Breast Cancer Data Set from Kaggle-Breast-Cancer-Wisconsin(Diagnostic). Follow the link to download the data set: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

So, let us start the project.

1 — Importing of Essential Libraries

2 — Importing of the Data Set

We import the data set through a python library pandas. And divide the data set into Dependent (y) and Independent (X) variable. The Dependent variable consists of cell status(M or B) and Independent variable consists of 30 features.

Output->

X Set
y Set

3 — Encoding the Categorical Data

As the Dependent (y) variable consists of strings, therefore they have to be encoded in binary format so that our model does not find any problem in classifying.To encode we use Label Encoder class from Scikit-learn library of python.

Output->

1=M and 0=B

4 — Splitting the Data set into Training Set and Test Set

Now for training and testing the model we first split the data into training set and test set. For splitting the data set into train and test we use model selection library from Scikit-learn .

5 — Feature Scaling the Data

As we need exact predictions from the model, we need to feature scale the data.Therefore, to feature scale the data we use the Standard Scaler class of preprocessing library from Scikit-learn.

Machine Learning Model Building

We have to build the best model to classify the cells. Therefore we have to train and test the dataset with multiple Machine Learning Algorithms and determine the best model with high accuracy.

Random Forest Regression Model

As the output from random forest model came floating number, we converted it to the binary form.

Output->

accuracy_score = 0.9298245614035088

f1_score = 0.9166666666666666

K-Nearest Neighbor Model

Output->

accuracy_score = 0.9473684210526315

f1_score = 0.9333333333333332

Naive Bayes Model

Output->

accuracy_score = 0.9122807017543859

f1_score = 0.8979591836734693

Decision Tree Classification Model

Output->

accuracy_score = 0.9298245614035088

f1_score = 0.9130434782608695

Random Forest Classification Model

Output->

accuracy_score = 0.9649122807017544

f1_score = 0.9565217391304348

Confusion Matrix

Output->

The random forest classifier gives an accuracy of 96.49% with 2 wrong predictions. Therefore we can say that random forest classifier model is our best model which can be used to classify cells.

Saving and Loading the Model

After completion of the Machine Learning project, the ML model need to deploy in an application. To deploy the ML model , we need to save it first. To save the Machine Learning project we can use the pickle or joblib package.Here I have used pickle package of python to load and save the model.The pickle or joblib package saves the model to that address, later on to deploy the model we can simply load the model through the pickle file.

Conclusion:

After training all the algorithms , we found that Random Forest Regression, KNN, Decision Tree Classification , Random Forest Classification Model have high accuracy. From them we choose the Random Forest Classification Model as it gives the highest accuracy.

Please share your valuable feedback regarding this article, and also share your doubt, so that I can update as soon as possible.

I hope you understood this Machine Learning Project and enjoyed it.I hope my efforts will be valuable for saving the life of Breast Cancer Patients.I hope more improvised model comes up with more accuracy and help us to tackle this problem.

--

--

Bhaskar Borah
Analytics Vidhya

I am an Engineering Undergraduate Student from Assam Engineering College,India. My fields of interest are in Machine Learning, Artificial Intelligence.