Camelyon16: Detecting Breast Cancer Metastases in Lymph Node Biopsies

Published in

Analytics Vidhya

8 min readMar 18, 2020

Over the past several months, I’ve been working on a cancer detection project using a dataset available on Kaggle.[0] I’ve summarized my work below — feel free to comment with any ideas or suggestions!

What is the problem?

Cancer can metastasize, or spread, to parts of the body other than where it originated. In particular, according the Camelyon website, “the lymph nodes in the underarm are the first place breast cancer is likely to spread.” [1] If breast cancer has metastasized to the lymph nodes, the prognosis is poorer. However, identifying metastases in lymph nodes is a “tedious, time-consuming, and prone to misinterpretation.” [1] Therefore, the goal of the Kaggle challenge and of this project is to create a model that can automatically detect breast cancer metastases in lymph node images. Such a model could reduce pathologists’ workload as well as improve diagnostic precision and recall.

Breast cancer can metastasize to the lymph nodes under the arm.

What data is available?

The Camelyon16 dataset consists of 400 hematoxlyin and eosin (H&E) whole-slide images of lymph nodes, with metastatic regions labeled. The Kaggle challenge further breaks down the data into 96x96px images drawn from the original whole-slide images. Each image is labeled positive if there is at least one pixel of cancer tissue in the center 32x32px region; otherwise, it is labeled negative. There are 220k training images, with a roughly 60/40 split between negatives and positives, and 57k evaluation images.

A positive data point (contains cancer tissue)

Problem setup — train/dev set and metrics

I decided to split out 10% of the training data as a development set that I could use to compare the efficacy of different models. I also decided to use the AUC (area under the ROC curve) as my primary development metric, which represents the probability that the model will rank a randomly chosen positive example higher than a randomly chosen negative one. The AUC is a better metric to use than accuracy when dealing with class imbalance — as a trivial example, if the negative/positive split is 99/1, then classifier could achieve high accuracy by simply predicting only negatives.

Data augmentation

One common technique to synthetically increase the amount of training data in computer vision tasks is to apply transformations to the data. I decided to create a copy of each training data image by applying a random rotation between -90 and 90 degrees, as well as flipping the image horizontally and/or vertically with probability 0.5. Note that other transformations such as brightness attenuation, cropping, and shearing are also possible, but I chose not to apply these since they could have introduced artifacts that are correlated with the outcome — for e.g. shearing could introduce morphological abnormalities in the cell shapes.

from imgaug import augmenters as iaaseq = iaa.Sequential([
   iaa.Affine(rotate=(-90, 90)), # Rotate image between -90 and 90 degrees
   iaa.Fliplr(0.5), # horizontally flip 50% of all images
   iaa.Flipud(0.5), # vertically flip 50% of all images
])

I only augmented the training data, and not the dev data — the reason is that we want to keep the dev and test data as close to the actual data we care about as possible, so that our metric readings on them reflect true performance on ‘real’, unaugmented lymph node images.

Also note that it is important to do augmentation after splitting your data into train and dev sets; otherwise, you may end up with training data leakage where an augmented version of a training image ends up in your dev set.

After data augmentation, we end up with 397k training data images and 21k dev set images.

Baseline Model — Logistic Regression

As a baseline model, I created a logistic regression that takes as input each pixel in the 96x96x3 image. I used Keras/Tensorflow to do so.

lr_model = Sequential()
lr_model.add(Flatten(input_shape=(96, 96, 3)))
lr_model.add(Dense(1))
lr_model.add(Activation('sigmoid'))lr_model.compile(
  optimizer='adam',
  loss='binary_crossentropy',
  metrics=['accuracy', 'AUC']
)lr_model.fit(
    x=train_x,
    y=train_y,
    validation_data=(validation_x, validation_y),
    epochs=1,
    batch_size=32,
    shuffle=True,
)

The baseline model achieved the following metrics:

training loss: 0.8519
training accuracy: 0.5782
training AUC: 0.5975
dev loss: 1.4036
dev accuracy: 0.4034
dev AUC: 0.7228

Clearly, there is significant room for improvement! What next steps should we prioritize? As Andrew Ng mentions in his course on Deep Learning [2], we should take a look at the Bayesian error (the lowest possible achievable error), the training error, and the dev error. Here, we have the AUC as our primary metric, but we can compute the error to be 1-AUC. We can also assume that the Bayesian error is approximately 0, corresponding to the best performance achieved on the Kaggle leaderboard (~1). Since both the training error and the dev error are significantly higher than Bayesian error, we can conclude that we have a bias problem —most likely, the model is not complex enough to capture the the interactions in the data. This means that we should train a bigger model.

Transfer Learning — NASNet Model

One common jumping-off point for computer vision tasks is to use a model pre-trained on another vision task, and fine-tune it for the task at hand. This is referred to as transfer learning.

Let’s try the NASNetMobile model, which is available via the Tensorflow/Keras libraries and has demonstrated state-of-the-art performance on ImageNet. Since NASNetMobile takes as input 224x224px images, how can this model be adapted to for a 96x96px input image? The answer is that we only use the convolutional and pooling layers from the model, which impose no requirement on input dimensions; and we eliminate the top fully-connected layer, since it expects an output volume of a certain size from the conv/pool layers.

As an initial approach, I decided to flatten the output volume and feed it into a Dense layer with one neuron. I also decided to unfreeze the NASNet model parameters, since the training data set seemed sufficiently large to train a model of this size.

inputs = Input((96, 96, 3))
base_model = NASNetMobile(include_top=False, input_tensor=inputs, weights=’imagenet’)
x = base_model(inputs)
x = Flatten()(x)
x = Dense(1, activation=”sigmoid”)(x)
model = Model(inputs, x)model.compile(
  optimizer=Adam(1e-4),
  loss=’binary_crossentropy’,
  metrics=[‘accuracy’, ‘AUC’])model.fit(
  x=train_x,
  y=train_y,
  validation_data=(validation_x, validation_y),
  epochs=6,
  batch_size=32,
  shuffle=True,
  callbacks=[model_checkpoint],
)

This yielded the following metrics:

Train loss: 0.053
Train AUC: 0.9977
Dev loss: 0.097
Dev AUC: 0.9920

Now, we have eliminated most of the bias problem — the gap between Bayesian error and training error is 1–0.9977 = 0.0023. The gap between the training error and dev error is 0.9977–0.9920 = 0.057, which tells us that we need to focus on addressing the variance issue. We can do so by regularizing the model, through dropout or L2 regularization.

Can we speed up training?

Training the NASNet model takes a long time — about 1300s per epoch, even when training on a GPU. Since I was training on Google Cloud, I had access to four GPUs, and wondered if I could use them to speed up training.

I tried to do so using MirroredStrategy in TensorFlow:

with tf.distribute.MirroredStrategy().scope():
   # Create, compile, and fit model
   # ...

MirroredStrategy copies all of the model’s variables to each GPU, and distributes the forward/backward pass computation in a batch to all the GPUs. Then, it combines the gradients from each GPU using all-reduce, and then applies the result to each GPU’s copy of the model. Essentially, it is dividing up the batch and assigning each chunk to a GPU.

Surprisingly, using MirroredStrategy initially made the training even slower — 2200s compared to 1300s! I attributed this to the fact that the batch size was relatively small at 32, and so the overhead of splitting up the batch may have exceeded the time savings of using multiple GPUs.

Then, I used a larger batch size of 1024. This resulted in a 6x training speedup, from 1300s to 218s per epoch. However, while this resulted in a similar training loss (0.046) as before, the dev loss exploded to 2.34.

It turns out that increasing the batch size is known to significantly increase the generalization gap — the gap between train and dev performance. The reason is that larger batch sizes tend to result in ‘sharp minimizers’, which do not generalize well to new data, whereas smaller batch sizes introduce some noise that allows them to escape these ‘sharp minimizers’ in favor of ‘flat minimizers’ that generalize better. [3] Unfortunately, this meant that larger batch sizes, and the corresponding speedup due to parallelization, were out of the question, barring techniques to improve the generalization gap on large batch sizes.*

*Others have observed before that increasing the learning rate can eliminate the generalization gap on large batch sizes; however, I found that doing so did not result in any gains.

Does pre-training help?

We saw above that moving from a simple logistic regression model to a pre-trained NASNet model improved AUC significantly. However, was this because NASNet was pretrained, or simply because NASNet’s architecture was more complex and therefore more able to learn the interactions in the training data? In other words, does pre-training actually help?

To answer this question, I ran an identical run that used randomly initialized weights instead of the pre-trained ImageNet weights. As shown below, it takes much longer for the model to learn and achieve good performance — even after 15 epochs, the model still did not perform as well as the pre-trained model after only 6 epochs!

Since I only trained the model for 15 epochs, I am not sure whether it would have caught up to the pre-trained version eventually — but we can reasonably conclude that a randomly-initialized model takes longer to match the performance of the pre-trained model, if it ever does at all.