Using Machine Learning Models for Breast Cancer Detection

Hannah Le
11 min readDec 3, 2018

--

Since the beginning of human existence, we have been able to cure many diseases, from a simple bruise to complex neurological disorders. Such concept used to be inconceivable to the first Homo sapiens 200,000 years ago.

Now, humanity is on the cusp of conceiving of something new: a cure to cancer. Cancer is currently the deadliest disease in the world, taking the lives of eight thousand people every single year, yet we haven’t been able to find a cure for it yet.

By merging the power of artificial intelligence and human intelligence, we may be able to step-by-step optimize the cancer treatment process, from screening to effectively diagnosing and eradicating cancer cells!

In this article, I will discuss how we can leverage several machine learning models to obtain higher accuracy in breast cancer detection. Let’s see how it works!

Phase 1: Preparing Data

First, I downloaded UCI Machine Learning Repository for breast cancer dataset.

The dataset was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin, USA. He analyzed the cancer cell samples using a computer program called Xcyt, which is able to perform analysis on the cell features based on a digital scan. The program returned 10 features of each of the cell within each sample and computed mean value, extreme value and standard error of each feature.

Phase 2: Exploring Data

Now, we can import the necessary libraries and the previous dataset into Spyder. Pandas is one of the Python packages that makes importing and analyzing data much easier. As seen below, the Pandas head() method allows the program return top n (5 by default) rows of a data frame or series.

We can also find the dimension of the data set using the dataset.shape() attribute. There is a total of 569 rows and 32 columns. In the column that represents diagnosis, we can observe that 357 of the sample is benign, and 212 of the sample is malignant.

Phase 3: Categorizing Data

As diagnosis contains categorical data, meaning that it consists of labeled values instead of numerical values, we will use Label Encoder to label the categorical data.

To do so, we can import Sci-Kit Learn Library and use its Label Encoder function to convert text data to numerical data, which is easier for our predictive models to understand. For instance, 1 means that the cancer is malignant, and 0 means that the cancer is benign.

Sci-kit Learn Library also allows us to split our data set into training set and test set. The purpose of this is to later validate the accuracy of our machine learning model. To accomplish this, we use the train_test_split method, as seen below!

Phase 4: Selecting models

Now, to the exciting part! 🔥

To classify two different classes of cancer, I explored seven different algorithms in machine learning, namely Logistic Regression, Nearest Neighbor, Support Vector Machines, Kernel SVM, Naïve Bayes, and Random Forest Classification.

Logistic regression

What is logistic regression to begin with? The name logistic regression actually comes from something known as the logistic function, also known as the sigmoid function, rising quickly and maxing out at the carrying capacity of the environment. Such model is often used to describe the growth of an ecology.

The linear equation for the above curve can be represented as:

Depending on the values of x, the output can be anywhere from negative infinity to positive infinity. But… there is a slight problem! If you recall the output of our cancer prediction task above, malignant and benign takes on the values of 1 and 0, respectively, not infinity.

To ensure the output falls between 0 and 1, we can squash the linear function into a sigmoid function. The common practice is to take the probability cutoff as 0.5. If the probability of Y is > 0.5, then it can be classified an event (malignant).

Once again, I used the Sci-kit Learn Library to import all algorithms and employed the LogisticRegression method of model selection to use Logistic Regression Algorithm. The accuracy achieved was 95.8%!

k-Nearest Neighbor (kNN)

kNN is often known as a lazy, non-parametric learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

Now, unlike most other methods of classification, kNN falls under lazy learning (And no, it doesn’t mean that the algorithm does nothing like chubby lazy polar bears — just in case you were like me, and that was your first thought!)

In actuality, what this means is that there is no explicit training phase before classification. Instead, any attempts to generalize or abstract the data is made upon classification.

Such situation is quite similar to what happens in the real world, where most of the data does not obey the typical theoretical assumptions made (as in linear regression models, for instance). Thus, kNN often appears as a popular choice for a classification study when little is known as the distribution of a data set.

Now that we understand the intuition behind kNN, let’s understand how it works! Essentially, kNN can be broken down to three main steps:

  • Compute a distance value between the item to be classified with every item in the training data set
  • Pick the k closest data point/ item
  • Conduct a “majority vote” among the data points. The dominating classification in that pool is decided as the final classification.

Let’s look at a simple example of how kNN works!

The test sample (inside circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (outside circle) it is assigned to the second class because there are 2 red triangles and only 1 blue square inside the inner circle. If k = 1, for example, it would be classified as class 1 as there is 1 blue square and 0 red triangle.

Easy, piesy, right? Not quite! There are still several questions that we need to ask: How do actually compute the distance (step 1) or find the value of k (step 1)? A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive.

There are many ways to compute the distance, the two popular of which is Euclidean distance and Cosine similarity.

Euclidean distance is essentially the magnitude of the vector obtained by subtracting the training data point from the point to be classified.

It can be determined using the equation below, where x and y are the coordinates of a given data point (assuming the data lie nicely on a 2D plane — if the data lies in a higher dimensional space, there would just be more coordinates).

Another method is Cosine similarity. Instead of explicitly computing the distance between two points, Cosine similarity uses the difference in directions of two vectors, using the equation:

Next, how do we find the value of k?

Usually, data scientists choose as an odd number if the number of classes is 2 and another simple approach to select k is set k=sqrt(n). A somewhat more rigorous method is by using cross-validation. What this means is that we arbitrarily choose a value of k and compare their corresponding accuracy to find the most optimal k.

After doing all of the above and deciding on a metric, the result of the kNN algorithm is a decision boundary that partitions the space of the feature vectors that represents our data set into sections. Then, we can calculate the most likely class for a hypothetical data-point in that region, and we thus color that chunk as being in the region for that class.

Below is a snippet of code, where I imported the kNN model from Sci-kit Learn Library and trained it on the cancer data set, resulting in an accuracy of 95.1%! In the code below, I chose the value of k to be 5 after three cross-validations.

Support Vector Machines (SVM)

Wow… sounds cool already!

Suppose we are given plot of two label classes on graph as shown in image (A). How shall we draw a line to separate the two classes?

We would end up with something like this. A green line fairly separates your data into two groups — the ones above the line are labeled “black” and the ones below the line are labeled “blue”.

Making it a bit more complicated, what if our data looks like this?

Now, instead of looking at our data from a xy plane perspective, we can flip the plot around and will be able to see something like below. Now that we are on the yz plane, we can nicely fit a line to separate our data sets!

When we transform back this line to original plane, it maps to circular boundary as shown below. These transformations are called kernels.

You can see where we are going with this: Overall, the objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.

Intuitively, we want to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

Following this intuition, I imported the algorithm from Sci-kit Learn and achieved an accuracy rate of 96.5%.

Naïve Bayes

Naive Bayes algorithm is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an orange if it is orange, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. So, how exactly does it work?

The Bayes Theorem is formally written like this:

Let’s think about a simple example to make sure we clearly understand this concept!

When P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then:

  • P(Fire|Smoke) means how often there is fire when we see smoke.
  • P(Smoke|Fire) means how often we see smoke when there is fire.
  • If dangerous fires are rare (1%) but smoke is fairly common (10%) due to factories, and 90% of dangerous fires make smoke then:
  • P(Fire|Smoke) =P(Fire) P(Smoke|Fire) =1% x 90% = 9%

→ In this case 9% of the time expect smoke to mean a dangerous fire.

Now, how does this apply to a classification problem?

Essentially, Naive Bayes calculates the probabilities for all input features (in our case, would be the features of the cell that contributes to cancer). Then, it selects the outcome with highest probability (malignant or benign). I implemented the algorithm on the cancer detection problem, and eventually achieved an accuracy of 91.6%.

Random Forest Classification

Finally, to our last algorithm — random forest classification! As the name suggest, this algorithm creates the forest with a number of trees.

Before diving into a random forest, let’s think about what a single decision tree looks like!

A decision tree is drawn upside down with its root at the top.

  • The bold text in black represents a condition/internal node, based on which the tree splits into branches/ edges.
  • The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived, represented as red and green text respectively.

Now, let’s consider the following two-dimensional data, which has one of four class labels:

A simple decision tree built on this data will iteratively split the data along one or the other axis according to some quantitative criteria. At each level, the label of a new region would be assigned according to the majority of vote of points within it.

However, an interesting problem arises if we keep splitting: for example, at a depth of five, there is a tall and skinny purple region between the yellow and blue regions.

It’s clear that this is less a result of the true, intrinsic data distribution, and more a result of the particular sampling. That is, this decision tree, even at only five levels deep, is clearly over-fitting our data!

Hmmmm… How do we solve this problem?

Well, if we look at the results of two decision trees, we can see that in some places, the two trees produce consistent results (e.g., in the four corners), while in other places, the two trees give very different classifications.

Thus by using information from both of these trees, we might come up with a better result!

Intuitively, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results!

Finally, I ran our final model on the sample data sets and obtained an accuracy value of 98.1%

In the end, the Random Forest Classifier enables us to produce the most accurate results above all!

This is one of my first applications in machine learning. Thank you for reading my article, and I hope you’ve enjoyed it so far!

My goal in the future is to dive deeper into how we can leverage machine learning to solve some of the biggest problems in human’s health. Feel free to stay connected with me if you would like to learn more about my work or follow my journey!

Personal Website: http://hannahle.ca

LinkedIn: https://www.linkedin.com/in/hannah-le/

--

--