Cancer Prediction Model

Shruti Garg
Analytics Vidhya
Published in
3 min readJan 20, 2020

In this article, we will focus on building the Logistic Regression Model in Python to predict the type of Breast Cancer Tumors.

Now, let us first understand, what BREAST CANCER is?

Well, Breast Cancer is a disease in which cells in the breast grow out of control.

Here we’ll be focusing on a dataset that has the data of different ‘customers id’ diagnosed with two types of Breast Cancer tumors. We’ll be building a Logistic Regression Model and also using the Random Forest Classifier.

Logistic Regression: Logistic Regression is the technique of finding relationships between a set of input variables and an output variable (just like any regression), but the output variable, in this case, would be a binary outcome ( i.e yes/no or 1/0).

You can look at the data here: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

The dataset has two types of Diagnosis :

1) Malignant Tumor (M) -They Are Cancerous.

2) Benign Tumor (B)-They are Non-Cancerous.

To build a Logistic Regression model, we will be representing diagnosis M as “1” and B as “0”.

Let us look at the columns of the data:

Clearly, we don’t have any null values in the data. Therefore, no missing value treatment needs to be done.

The libraries which are needed to be imported in python are as follows:

Representing diagnosis M as “1” and B as “0”.

Therefore, 357 number of people are diagnosed with Benign Tumor whereas 212 are diagnosed with Malignant Tumor.

Now, we’ll be splitting the data into the testing data and the training data and will be building a logistic regression model and also we will be checking the accuracy of the model.

Therefore, we get 94.4% of the accuracy with the Logistic Regression Model.

Now let us see how much the accuracy will be there if we use Random Forest Classifier.

Therefore, we get 99.3% of the accuracy by using Random Forest Classifier.

Let us now see the Confusion Matrix for Logistic Regression and Random Forest Classifier.

Confusion Matrix: It is a table that is used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

CONCLUSION:

LOGISTIC REGRESSION MODEL

We can clearly observe from above confusion matrix that from our testing data, 6 customers had the Malignant Tumor but were tested as Benign whereas 2 customers which had Benign Tumor were tested as Malignant Tumor in our model.

RANDOM FOREST CLASSIFICATION

With 99% of accuracy, only 1 customer that was having Benign Tumor were tested as Malignant Tumor in our model.

Therefore, with this article, we get an idea of how we can use the Logistic Regression which is one of the popular algorithm to solve a classification problem.

Thanks for reading:)

--

--