Classification of breast cancer using Intel®Devcloud

Debanjona Bhattacharjya
Intel Software Innovators
7 min readJun 29, 2019
DNA CELL FOR BREAST CANCER

One of the life aspects which holds the maximum importance in any country is the issue of health care.

This is the factor having the biggest impact on the welfare of the whole country.

Recent technologies have seen increasing utilization in the field of healthcare using machine learning, deep learning, and computer vision. Recently, I have got my hands dirty on machine learning technologies and I have begun working on the classification of breast cancer(malignant and benign). This article is focused on the procedure I followed to implement the project.

Let's get started….

Background:

Breast cancer is the phenomenon of uncontrolled growth of cells in different areas of breasts. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, fluid coming from the nipple, a newly inverted nipple, or a red or scaly patch of skin. In extreme cases, there may be bone pain, swollen lymph nodes, shortness of breath, or yellow skin.

There are several ways to diagnose the advent of cancerous cells in the breasts. This includes breast exams, mammograms, breast ultrasound, and biopsy. By far, the process of biopsy is the most efficient way to detect breast cancer. In biopsy, a small portion of the cells of tissue from the suspicious area is taken out by the clinician using a special needle and sent to the laboratory where experts determine whether the cells are cancerous.

Early diagnosis significantly increases the chances of survival. The main issue is to classify the cancer cells into malignant (cancerous) or benign(non-cancerous). A malignant tumor is one in which the cells can grow into surrounding tissues or spread to distant areas of the body. A tumor is benign if it does not invade nearby tissue or spread to other parts of the body the way cancerous tumors can. But benign tumors can be serious if they press on vital structures such as blood vessels or nerves.

Project Task:

In this study, my task is to classify cancer cells into two categories using machine learning techniques (benign and malignant(the explanation is given in the “Background” portion). Here, I have used the breast cancer Wisconsin dataset from KAGGLE. You can download it from here.

With the help of the given dataset, I am trying to train the model and classify the cells. For faster computation and training, Intel Devcloud is used.

Intel® Devcloud:

Intel® DevCloud is a cloud-hosted hardware and software platform available to developers, researchers, and startups to learn, sandbox and get started on their Artificial Intelligence projects. Intel Academy members can gain free access to the DevCloud powered by Intel® Xeon® Scalable Processors for their machine learning and deep learning compute needs. You can register for Devcloud here.

Datasets:

The dataset is obtained from Breast Cancer Wisconsin Dataset available in Kaggle. It is also available in UCI Machine Learning Repository. Features of the dataset are computed from a digitized image of a Fine Needle Aspirate of a Breast Mass.

Attribute Information:

1) ID number

2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from the center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter² / area — 1.0)

g). concavity (severity of concave portions of the contour)

h). concave points (number of concave portions of the contour)

i). symmetry

j). fractal dimension (“coastline approximation” — 1)

Procedure:

1.Setting up the Devcloud:

I got access to the Intel® Devcloud by registering to the following link mentioned in the Devcloud section. There are two connection options. One is through the terminal and another is through Jupyter Notebook. I chose Jupyter Notebook for my project as it is user-friendly and stable.

To get access to Jupyter Notebook through Devcloud, there are two options-if one is already logged in, one can press the one-click login button or one can enter the user id and password through the URL given there.

SSH client
Jupyter Notebook

Now, we are ready with the dataset and Devcloud. Let’s progress with the process.

In Jupyter notebook, create a folder by going to new on the upper right side.In that folder UPLOAD the dataset(data.csv) and open another file with extension .ipynb.In the .ipynb file, you are supposed to write the code.

2.Importing the libraries and dataset:

The important libraries for computation and plotting should be imported using the import function.

The dataset is imported as a data frame.

Libraries used
dataset

Output:

3.Cleaning and preparing the data:

4.Exploring the data:

Output:

5.Splitting the id into malignant and benign(1 and 0):

6.Stacking the data:

In the sixth step, the data is being stacked. The final layout is plotted using the matplotlib library. The source code is given below:

The output of this code is here:

Observation:

From the histogram, it is observed that the mean values of cell radius, perimeter, area, compactness, and concave points can be used for the classification method.

Creating the training and testing datasets:

For training the datas and testing them to predict the accuracy,the dataset is split into train-test datasets in the 70:30 ratio.For this purpose,the “train_test_split” function is used.

splitting of data

8.Creating the classification model:

In this step, the function is created for the classification model and the performance of various classifiers are observed.

classification model

Decision Tree model:

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. (Source: sci-kit learn)

By the histogram data, it is seen that the predictors depend solely on mean values of radius, area, perimeter, compactness, and concave_points.

Logistic Regression Model:

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success etc.) or 0 (FALSE, failure etc.).

Random Forest Model:

A random forest is a data construct applied to machine learning that develops large numbers of random decision trees analyzing sets of variables. This type of algorithm helps to enhance the ways that technologies analyze complex data.

Using all the features improves the prediction, as well as the cross-validation score, which is good.

Three different types of classifiers are used for the training data.

Now, test data is to be examined.

Test Data:

Conclusion:

After the training and testing of data, it is concluded that the random forest classifier gives the maximum accuracy with the prediction accuracy of 95% and testing accuracy of 93%.

I hope this accuracy can be increased further by using other models or tweaking the models. I have planned to implement this project further and implement it in the form of a web-app.

Thanks for reading.

--

--