The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K…

Advancing Breast Cancer Detection: A Machine Learning Breakthrough in Medical Imaging

Filippos Dounis
The Startup
Published in
20 min readApr 21, 2020

--

Photo by National Cancer Institute on Unsplash

Breast cancer is a type of cancer that develops from breast tissue. Following skin cancer, breast cancer constitutes the most common cancer diagnosed in women in the United States, as well as the most common form of cancer in women over the age of 50 in the United Kingdom. Although symptoms differ from person to person due to the many variables involved, according to experts, signs, and symptoms of breast cancer include, but are not limited to:

Change in the size, shape or appearance of a breast

Changes to the skin over the breast

A newly inverted nipple

Redness or pitting of the skin over ones breast

A breast lump or thickening that feels different from the surrounding tissue

Peeling, scaling, crusting or flaking of the pigmented area of skin surrounding the nipple (areola) or breast skin

These are mere examples of the plethora of possible symptoms a woman (or man in rarer occasions) may experience.

The problem is, that women tend not to pay much attention to such symptoms. They pass them off as something random that will simply stop by itself as they fail to grasp that it will not simply fade away. Most women have made the flawed and non-factual assertion that breast cancer is something rare and it is unlikely that they ever fall as one of its victims. Unfortunately, data and scientific observation simply do not confirm this. On the contrary, around 1 in 8 women are diagnosed with breast cancer during their lifetime.

Researchers have identified hormonal, lifestyle and environmental factors as indicators of a person’s risk of being a victim of breast cancer. Despite the aforementioned, there are many cases where people with no risk factors develop such cancer. At the same time, people being identified with several risk factors may never do. The assertion is thus made, that breast cancer is caused by a complex interaction of one’s genetic makeup and environment.

Nipple changes observed in breast cancer victims

The Problem

It must have become apparent that breast cancer is an issue that concerns a great number of people around the world. The issue is, that doctors are not always reliable when detecting the above-said type of cancer. From personal experience, my grandmother had to visit a dozen different radiologists across different continents, in order for a consensus to be reached concerning her condition. Even then, there was much uncertainty about whether the final diagnosis was the correct one. Years later and doctors kept giving her different opinions and diagnosis’.

In a different setting, perhaps such a situation would be acceptable. When on the other hand, lives are on the line, professionals not being able to reach common grounds on an ideal course of action should be a phenomenon averted at all costs.

This is by no means attributed to the doctors on my part. I believe that the doctors are doing the best they can with their pre-existing knowledge and skills. But accepting the situation the way it is, is simply not possible.

The Solution

An ideal and realistic solution would be one, that would in no case remove the doctors from the equation. On the contrary, a third set of eyes should be established in order to assist doctors in confirming their diagnosis. This same set of eyes could also be used by women at home, interested in hearing at a first opinion without physically going to a doctor. Remember, prevention is the optimal and best way to solve any problem.

Having the aforementioned thought process in mind, in this paper, I will be attempting to build and compare different models that will successfully detect if a person has breast cancer.

Methods Presented (Read this!)

In this article, I will be pursuing two different ways in order to breast cancer.

The first method will require the patient to input certain data by hand and it will tell the patient whether he/she has a malignant tumor.

The second method, on the other hand, will only require an image from the patient, from which it will detect whether he/she has cancer and it will also show in which part of the tissue.

*The dataset used in the second method includes samples of Invasive Ductal Carcinoma (IDC). IDC is the most prevalent form of breast cancer (+ 80% of the cases). For more information click here.

Key Terms

It is crucial, in order to proceed, to become acquainted with certain key-terms that will be used throughout this article.

(If you are not interested in understanding the key-words in-depth, you can directly go to the Conclusion)

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression[1] (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

Logistic Regression Model Example

K Nearest Neighbor Algorithm

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  • In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.
K Nearest Neighbor

Support Vector Machine

In machine learning, support-vector machines (SVMs, also support-vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall.

Naive Bayes classifier

In machine learning, naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.[1] But they could be coupled with Kernel density estimation and achieve higher accuracy levels.

Decision Tree Classifier

Decision tree learning is one of the predictive modeling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels.

Decision Tree Visualization

Random Forest Classifier

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.[1][2] Random decision forests correct for decision trees’ habit of overfitting to their training set.

Features

A feature is a measurable property of an object we are analyzing. In a dataset, features appear as columns and are the different characteristics of an object (e.x. price, location, id)

Confusion Matrix

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm.

NaN values

In computing, NaN is a value that is not a number.

Overfitting

In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

Underfitting/ Good fit/ Overfitting

Preparing the dataset and work environment

First, a supported version of python is needed to be installed. To do so, navigate to this link and follow the instructions for the operating system of choice.

I will be using Python 3.6.9 and Ubuntu 18.04.4 LTS as my Operating System. Nevertheless, all supported python versions are welcome.

Before proceeding with installing the required libraries, pip must be also installed. (I am pretty certain that pip comes with all python versions after 2.7.9 but if pip is not already installed, follow this guide.)

Libraries

The following libraries should be installed with pip:

pip3 install numpy
pip3 install pandas
pip3 install matplotlib
pip3 install seaborn
pip3 install sklearn

Dataset

Having the right dataset is undoubtedly one of the most important aspects of any data science project. In this case scenario, two different datasets are needed. The first dataset will be used for the first application of the models and it must contain the following features:

- diagnosis: The diagnosis of breast tissues (M = malignant, B = benign).- radius_mean: Mean of distances from center to points on the perimeter.- texture_mean: Standard deviation of gray-scale values.- perimeter_mean: Mean size of the core tumor.- smoothness_mean: Mean of local variation in radius lengths.- compactness_mean: Mean of perimeter^2 / area - 1.0 .- concavity_mean: Mean of severity of concave portions of the contour.- concave points_mean: Mean for number of concave portions of the contour.- fractal_dimension_mean: Mean for "coastline approximation" - 1 .- radius_se: Standard error for the mean of distances from center to points on the perimeter.- texture_se: Standard error for standard deviation of gray-scale values.- smoothness_se: Standard error for local variation in radius lengths.- compactness_se: Standard error for perimeter^2 / area - 1.0 .- concavity_se: Standard error for severity of concave portions of the contour.- concave points_se: Standard error for number of concave portions of the contour.- fractal_dimension_se: Standard error for "coastline approximation" - 1 .- radius_worst: "worst" or largest mean value for mean of distances from center to points on the perimeter.- texture_worst: "worst" or largest mean value for standard deviation of gray-scale values.- smoothness_worst: "worst" or largest mean value for local variation in radius lengths.- compactness_worst: "worst" or largest mean value for perimeter^2 / area - 1.0 .- concavity_worst: "worst" or largest mean value for severity of concave portions of the contour.- concave points_worst: "worst" or largest mean value for number of concave portions of the contour.- fractal_dimension_worst: "worst" or largest mean value for "coastline approximation" - 1 .- area_mean- symmetry_mean- perimeter_se- area_se- symmetry_se- area_worst- perimeter_worst- symmetry_worst

Any large-enough dataset that entails the abovementioned should be fine. After closely examining the publicly available datasets I could access, I have resulted that the dataset best tailored to the project’s needs is the Breast Cancer Wisconsin (Diagnostic) Data Set.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

The second dataset should be one with breast histopathology images. Some good resources are the following:

I will be personally using the following Kaggle dataset, which contains image-samples of “Invasive Ductal Carcinoma (IDC)” (the most common subtype of all breast cancers). My dataset can be accessed from here:

Coding

Now that both the libraries and dataset are set-up, it is time to begin the actual coding of the model (I will be using a Jupyter notebook).

(1) Detecting breast cancer using the Breast Cancer Wisconsin (Diagnostic) Data Set

I will begin by importing all necessary libraries and dependencies:

Now that all libraries have been imported, I will be importing the historical data into a pandas dataframe called ‘data’ and viewing its content.

Although I am aware of the number of features in the dataset, I do not know how many patients constitute my dataframe (the head only shows the five first entries). To change this I will view the shape of the data.

It is evident that there are 569 different patients with 33 distinct features. It is of paramount importance that there are no NaN values included. To verify whether this is the case, I will be using the ‘isnull()’ method.

The majority of the columns are fine, except for the last one, where all values appear to be NaN. There is no need to be alarmed as this can be swiftly solved by dropping the values.

Now that the dataset is formatted in the desired manner, it is time to examine the number of cases where patients were found to have Benign (B) non-cancerous cells and Malignant (M) cancerous cells.

Everything looks better visualized. I will be hence using seaborn, in order to plot the two different cases on a graph.

There is obviously a problem that needs to be dealt with at once. Column ‘diagnosis’, contains strings (‘M’, ‘B’). Its contents must be converted into the integers 0 and 1 with:

M ---> 1
B ---> 0

It is important, that the values are not simply changed, but the data type of the column changes with them. Fortunately, Sklearn’s ‘LabelEncoder’ does exactly what I need.

Seaborn can now be used in order to create a pair plot.

Scatter Plot

With the scatter plot plotted I will now be correlating the features and then visualizing the correlation using a heat map.

Sample of feature correlation
HeatMap of correlations

The tweaking of the data has now elapsed. With that being said, the dataset must be now split into the dependent and independent set, and the training and testing set (the data are also going to be scaled).

It is now time to create all of the models to be compared.

(1) Logistic Regression
(2) K Nearest Neighbor
(3) Support Vector Machine (Linear Classifier)
(4) Support Vector Machine (RBF Classifier)
(5) Gaussian Naive Bayes
(6) Decision Tree Classifier
(7) Random Forest Classifier

To make things easier I will be making a function that will contain the entirety of the models and then create a different model that will contain all the models.

Random Forest appears to have the best accuracy out of the models at 99.78% (as you can see, decision tree did not really have 100% accuracy).

This by itself, is not enough. I will be also making the confusion matrix of the models, in order to have a clearer idea of the models’ accuracy.

TN ---> True Negative
TP ---> True Positive
FN ---> False Negative
FP ---> False Positive

By creating the confusion matrix of the models it becomes obvious that the two most accurate models are the Support Vector Machine (Linear Classifier) and the Random Forest Classifier with an accuracy of 98.24% and 97.36%.

The Support Vector Machine (Linear Classifier) made the following predictions:

- 67 True Negatives
- 45 True Positives
- 0 False Positives
- 2 False Negatives

The Random Forest Classifier made the following predictions:

- 66 True Negatives
- 45 True Positives
- 1 False Positives
- 2 False Negatives

(2) Detecting breast cancer from images

It becomes evident that although the first method appears to be successful, getting hold of the information required may be difficult for most individuals.

An easy way to overcome this problem is by using images. Even if one can access the specialized information required by the first model, being able to have two different results could facilitate the decision-making process of the patient dramatically.

I will begin building my model by importing the required libraries.

Now that the required libraries and dependencies have been imported, some basic settings must be set.

The data structure of the dataset is at least confusing so I will try to elaborate on the architecture along the way.

What I currently care about, is that there are 279 patients:

The number of patients is unfortunately quite small, thus much attention must be placed on the overfitting of data.

In order for our algorithm to decide whether a patient suffers from breast cancer, the algorithm must take the entirety of the single patches as input. It is thus crucial that the number of total images becomes known.

In order to avoid having to store the pixel values of every one of the 277,524 images, I will be storing the path of each image, as well as the ‘patient_id’ and ‘target’.

I can not stress enough how important data visualization is in machine learning and data science in general. Hence, I will be plotting the data I already know from the data-set.

What can we gain from the plots?

(1) The last graph clearly shows that the IDC classes are disproportionate compared to the non-IDC ones.

(2) The number of image patches per patient varies a lot.

(3) In some patients, IDC tissue reaches 80%. Consequently, the tissue is either full of cancer, or only the cancerous part has been covered by the radiologist taking the image.

Let's take a look into some sample cancerous and non-cancerous patches.

Cancerous Patches
Healthy Patches

What can we gain from visualizing the patches?

(1) Patches with cancer look more violet and compact than healthy ones.

(2) Might be just random but it appears that the white dots that I suppose are part of the mammary ducts are fewer in number in the patients suffering from cancer.

Perhaps no significantly valuable intel was gained by visualizing the patches, but, nevertheless, it is interesting to observe them.

It is now time to make our data frames, and start visualizing the breast tissue itself. The first half can be simplified by some simple functions.

Before looking at the actual tissue, binary target visualization will help reach some conclusions by inspecting a sample of patients.

What can we gain from binary target visualization?

Well… surprisingly not that much. The only thing that seems to be of interest, is the observation that in some instances, tissue patches have been discarded or have been probably lost.

Perhaps binary target visualization was not that helpful. Nevertheless, given the coordinates of the image patches, I can now recreate the entire image of the tissue.

Breast Tissue of Patient 8955

The only reason for which the specific tissue was loaded, is because of the sample id I set for the variable ‘patient_identifier’. This can be easily changed by changing the id number.

Ex.

Breast Tissue of Patient 10273
Breast Tissue of Patient 10272

What can we gain from reconstructing the tissues?

(1) The lift image depicts the tissue without any target information.

(2) The right image depicts the tissue with the cancerous cells highlighted with a bright red color.

(3) Darker, violet-colored tissue tends to be associated with cancerous cells. At the same time, this does not appear to be the case the entirety of the times. It is thus crucial that the model does not mistakenly learn that all dark areas are cancerous cells, as they most probably are mammary ducts.

Analysis of the data has elapsed. The time has come, to start setting up the machine learning model.

I usually assign 80% of the dataset for training and the other 20% for testing. In this case, as mentioned before, validation is crucial. I will be thus assigning 70% for the model’s training, 15% for the testing, and 15% for the validation.

In other words, 195 patients will be used for training, 42 for testing, and the remaining 42 for validation.

It is important to remember the uneven distribution of cancer and healthy tissue samples.

Now comes the actually difficult part. As I am dealing with images I will have to use PyTorch, to create the datasets. Unfortunately, my familiarity with PyTorch is worse than with assembly, so the way I process the data may not be the optimal path (I will be almost entirely basing the following code on variations of the documentation).

I will be now creating the dataloaders for PyTorch:

A major problem I personally encountered is with the torch device. It will all become clear in a bit but for now, running the following piece of code is essential:

Defining the model structure is also obviously critical.

Setting the loss function and evaluation metric is now in order (obviously that if statement does not apply to me, but it is crucial if you are using the Cuda compute platform).

Everything appears to be going as planned. I will now be making a not-so-short function which will make a loop that will train the data.

Now that the function meant to train the data is complete (5 hours session with Google and StackOverflow entailed), an optimal cyclical learning rate must be found (there is a great that helped a lot in the bibliography section).

The final step how now been reached. Attention though! I previously highlighted that it is important to remember whether the output was ‘cpu’ or ‘cuda’ (you should know that anyways). The problem is that there is a different course of action that must be followed in either case (for some reason an if statement does not appear to make the trick).

I called this, the final step, as I will be training the model.

For ‘cpu’:

For ‘cuda’:

Training the data took more than a day when running on an overclocked Intel(R) Core(TM) i7–6700K CPU, so keep in mind this might take a while.

The results (and errors) can be now seen below:

Although the errors can give us much insight into the performance of the model, it is time to test it and watch live results!

Update! These are not all the same people. They are three different test subjects. They have been mistakenly labeled as patient ‘8955’.

Conclusion

The aim of this article was to successfully detect breast cancer utilizing different methods. I can proudly say that the model outperformed my expectations.

Although the model was a success, there is always room for improvement. I encourage everyone to experiment with more Machine Learning techniques, in order to reach better results. I will be experimenting more on the image detector with CNN (I will be publishing an article with my findings).

To sum up:

From the first technique, the two most accurate models were Support Vector Machine (Linear Classifier) and the Random Forest Classifier with an accuracy of 98.24% and 97.36%.

With the second technique, I managed not only to detect whether a person has cancer but also to locate cancer in an image of the breast’s tissue.

Prediction Results

Bibliography:

Basavanhally, Ajay, et al. “Automatic Detection of Invasive Ductal Carcinoma in Whole Slide Images with Convolutional Neural Networks: (2014): Cruz-Roa: Publications.” Spie, spie.org/Publications/Proceedings/Paper/10.1117/12.2043872.

Bhande, Anup. “What Is Underfitting and Overfitting in Machine Learning and How to Deal with It.” Medium, GreyAtom, 18 Mar. 2018, medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76.

“Breast Cancer.” Breast Cancer | Cancer Research UK, 21 Sept. 2017, www.cancerresearchuk.org/about-cancer/breast-cancer.

“Breast Cancer.” Mayo Clinic, Mayo Foundation for Medical Education and Research, 22 Nov. 2019, www.mayoclinic.org/diseases-conditions/breast-cancer/symptoms-causes/syc-20352470.

“Breast Cancer Information and Support.” Breastcancer.org, 17 Apr. 2020, www.breastcancer.org/.

“Breast Cancer: Breast Cancer Information & Overview.” American Cancer Society, www.cancer.org/cancer/breast-cancer.html.

Brownlee, Jason. “Logistic Regression for Machine Learning.” Machine Learning Mastery, 12 Aug. 2019, machinelearningmastery.com/logistic-regression-for-machine-learning/.

Cavaioni, Michele. “Machine Learning: Decision Tree Classifier.” Medium, Machine Learning Bites, 5 Feb. 2017, medium.com/machine-learning-bites/machine-learning-decision-tree-classifier-9eb67cad263e.

“Decision Tree Learning.” Wikipedia, Wikimedia Foundation, 20 Apr. 2020, en.wikipedia.org/wiki/Decision_tree_learning.

Dehaene, Thomas. “Adaptive — and Cyclical Learning Rates Using PyTorch.” Medium, Towards Data Science, 21 Mar. 2019, towardsdatascience.com/adaptive-and-cyclical-learning-rates-using-pytorch-2bf904d18dee.

Janowczyk, Andrew, and Anant Madabhushi. “Deep Learning for Digital Pathology Image Analysis: A Comprehensive Tutorial with Selected Use Cases.” Journal of Pathology Informatics, Medknow Publications & Media Pvt Ltd, 26 July 2016, www.ncbi.nlm.nih.gov/pubmed/27563488.

“K-Nearest Neighbors Algorithm.” Wikipedia, Wikimedia Foundation, 14 Apr. 2020, en.wikipedia.org/wiki/K-nearest_neighbors_algorithm.

“Logistic Regression.” Wikipedia, Wikimedia Foundation, 18 Apr. 2020, en.wikipedia.org/wiki/Logistic_regression.

Menon, Adarsh. “Logistic Regression in Machine Learning Using Python.” Medium, Towards Data Science, 30 Dec. 2019, towardsdatascience.com/logistic-regression-explained-and-implemented-in-python-880955306060.

“Naive Bayes Classifier.” Wikipedia, Wikimedia Foundation, 16 Apr. 2020, en.wikipedia.org/wiki/Naive_Bayes_classifier.

NHS Choices, NHS, www.nhs.uk/conditions/breast-cancer/.

“Overfitting.” Wikipedia, Wikimedia Foundation, 12 Apr. 2020, en.wikipedia.org/wiki/Overfitting.

“Random Forest.” Wikipedia, Wikimedia Foundation, 17 Apr. 2020, en.wikipedia.org/wiki/Random_forest.

Srivastava, Tavish, and Tavish Srivastava. “K Nearest Neighbor: KNN Algorithm: KNN in Python & R.” Analytics Vidhya, 1 Apr. 2020, www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/.

“Support-Vector Machine.” Wikipedia, Wikimedia Foundation, 14 Apr. 2020, en.wikipedia.org/wiki/Support-vector_machine.

The Startup
The Startup

Published in The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

Filippos Dounis
Filippos Dounis

Written by Filippos Dounis

COO at Dounis Family Office || Royal Family Office Consultant | CMU-Q Scholar at Carnegie Mellon University || Top Writer in Finance and Business