Training Neural Networks for binary classification: Identifying types of breast cancer (Keras in R)

Published in

Duke AI Society Blog

6 min readMar 8, 2019

In this article, I will cover the training of deep learning algorithm for binary classification of malignant/benign cases of breast cancer. This will be possible by using a trustworthy machine learning framework: Tensorflow (backend) using Keras (API) in R.

About Deep Learning

There is a lot of interesting mathematics involved in a neural network. However, to put it in a simple way, let’s understand two main aspects:

(1) Different from a linear regression where multiple input variables will explain one outcome. In a neural network, multiple input variables will be tested and combined multiple times in order to explain one or more outcomes.

(2) Neural networks will learn by figuring out that they got wrong and then working backward into the algorithm to discover what values and connections made the output incorrect

About TensorFlow

TensorFlow is an open source software that Google developed to implement machine learning techniques (mainly neural networks) in 2015. The name comes from the usual array of matrices or Tensor and the flow of data crossing each operation node. Today it is one of the most used frameworks to build neural networks. Read how to: Install Tensorflow here and Keras for R here.

About Breast Cancer

According to the American Cancer Society [1], breast cancer is the presence of fast-growing cells in the breast that eventually form a tumor. The tumor will be malignant (cancer) if the cells can grow into surrounding tissues or spread (metastasize) to distant areas of the body. These uncontrolled cells can start growing at the ducts that carry milk or at the glands that make breast milk.

Although many types of breast cancer can cause a lump in the breast, not all do. Many breast cancers are found on screening mammograms which can detect cancers at an earlier stage, often before they can be felt, and before symptoms develop. — ACS

The data set used in this article is provided by the University of California Irvine, contains information from cases of the state Wisconsin and it can be found here.

Image: Phys — SPIE: International Society for Optics and Photonics

Training the machine

The next libraries are required to run the program as well as Google’s Tensorflow package which can be installed in any computer using Python (v3.6 or below). However, this article will approach the use of R to train the model and test it.

# Requiered libraries
library(tidyverse)
library(keras)
library(fastDummies)
library(caret)
library(tensorflow)
library(kerasR)

We will set up our binary variable to be a factor variable taking 0 and 1 as levels aligning with Keras format.

# Loading data
raw_df <- read.csv(‘data.csv’)
dummy = dummy_cols(raw_df,remove_first_dummy = TRUE)
df=dummy[,3:33]
df[‘diagnosis_B’]=as.factor(df$diagnosis_B)
head(df)

A glimpse look at our data

Our dataset contains information about the shape, texture and other qualities about tumors, they can be seen in the following table:

Some features of the group of tumor cells have a strong relationship that we will try to exploit. To learn more about the data click here

# Preparing a subset for training and other for testing
index <- createDataPartition(df$diagnosis_B, p=0.7, list=FALSE)
df.training <- df[index,]
df.test <- df[-index,]

It is necessary to prepare the data for its use in the model, that is why we make the following adjustment:

# Size and format of data frame
X_train <- df.training %>% 
 select(-diagnosis_B) %>% 
 scale()
y_train <- to_categorical(df.training$diagnosis_B)X_test <- df.test %>% 
 select(-diagnosis_B) %>% 
 scale()
y_test <- to_categorical(df.test$diagnosis_B)

At each node (or layer) we are allowed to use a different activation function to mathematically treat every data point. The most popular one: the Sigmoid function replicates the way neurons in our brain activate chemically, from where the name Neural Network comes from.

Besides Sigmoid function, we will also use the called Rectified Linear Unit function or ReLU which in essence helps the algorithm to learn faster and reduce the likelihood of the gradient to vanish while optimizing the model.

Gradient optimization for a linear regression

Programming the neural network in R:

# Network design
 model <- keras_model_sequential()
 model %>%# Input layer
 layer_dense(units = 256, activation = ‘relu’, input_shape =  ncol(X_train)) %>% 
 layer_dropout(rate = 0.4) %>% 
# Hidden layer
 layer_dense(units = 75, activation = ‘relu’) %>%
# Output layer
 layer_dropout(rate = 0.3) %>%
 layer_dense(units = 2, activation = ‘sigmoid’)

We will use the ADAM optimization algorithm to find the optimal weights for each node. In addition, we will be dropping some of the points in order to avoid overfitting of the model by introducing some noise into the learning process. Although this depends on the characteristics of the data set, it is regular to use between 20% and 40%.

Finally, our loss function of choice will be binary cross-entropy. A good way to measure the probability of correctly classifying each observation and their combined dynamics.

# Network config
history <- model %>% compile(
 loss = ‘binary_crossentropy’,
 optimizer = ‘adam’,
 metrics = c(‘accuracy’)
)# Running our data
model %>% fit(
 X_train, y_train, 
 epochs = 100, 
 batch_size = 5,
 validation_split = 0.3
)summary(model)

Model performance

As we can see from the following model performance plot, our model seems to be overfitting since the loss function doesn’t seem to drop considerably. However, we are able to achieve high out-of-sample accuracy: 96.47% This high accuracy while having an overfitted model could be the result of the relevant correlation between our sample data and the training data.

# Calculating accuracy
predictions <- model %>% predict_classes(X_test)# Confusion Matrix
df.test$diagnosis_B=as.integer(df.test$diagnosis_B)-1
table(factor(predictions, levels=min(df.test$diagnosis_B):max(df.test$diagnosis_B)),factor(df.test$diagnosis_B, levels=min(df.test$diagnosis_B):max(df.test$diagnosis_B)))

Final thoughts

Although the model seems to be highly accurate, we know we are being lucky, by having a strong correlation between the data points. The model (overfitted) could be improved by adding regularization methods. Although it is not the scope of this article, further research showed that using an adjusted support vector machine has a higher performance than a regular neural network. However, reaching a 96.5% accuracy for a simple model like the previously mention is indeed an excellent start.