Transfer Learning for Breast Lesion Detection from Ultrasound Images

Sarah Kefayati, Fellow at Insight Health Data Science (spring 2017), obtained her PhD is Physics from the University of Western Ontario, Canada. During her PhD, she performed experimental studies using laser-based velocimetry and advanced fluid dynamics concepts to both understand and develop metrics to quantify the progression of atherosclerosis leading to ischemic stroke. She then joined the Department of Radiology at UCSF for a postdoctoral training and worked on advanced MRI techniques for vascular imaging applications.

I developed an interest for deep learning during my postdoc years as I started working side by side the Radiologists analyzing variety of clinical images . I’ve been curious to learn about current status of the deep learning field and the the advantages that a deep learning network can provide for a more efficient and convenient patients diagnosis.

During my fellowship at Insight program, I decided to start my journey to the deep-learning-land by working on a consulting project for iSono Health – a startup company based in San Fransisco with the great mission of making breast exams more frequent, convenient, safer, and interpretable for patients. I remember as I was reading through the description of the project, I kept saying “yes!”, “yes!”. And the reason for this excitement was two folds: first, I could certainly see myself using their technology and second, I had a large number of images to do deep learning on for this great cause and, here it is …


Breast cancer is the most common cancer in women affecting about 1 in 8 U.S. women. Mammogram is the primary screening tool for breast cancer, but has significant shortcomings for early detection: 1) limited access; 2) insufficient frequency; 3) x-ray radiation; 4) deficient sensitivity. As a result more than 50% of breast cancers are discovered in the interim period between screenings and 1 in 3 breast cancers are missed at early stages. Ultrasound is a proven, scalable technology and the most accessible solution worldwide for early breast cancer detection. However, dependence on a highly skilled operator have limited wide implementation of hand-held ultrasound probe-based system. iSono Health platform combines automated 3D ultrasound technology with artificial intelligence and cloud computation for accessible and personalized breast health monitoring (Figure 1).

Figure 1 — The portable 3D ultrasound coupled with a machine learning algorithm that help patients monitor their breast that convenience of their home (credit: Isono Health)

Images and Model Selection:

The image set included total of 1061 benign images from 13 subcategories and 1472 malignant images from 14 subcategories. Figure 2 depicts the complexity of the images and the nontrivial nature of detecting the class of the lesion — benign or malignant — given the wide spectrum of characteristics associated with lesions. I was also given about 3000 video frames of normal images (without tumor).

Deep neural networks built based on multiple convolutional layers (convolutional neural net or ConvNets) have proven to be very efficient for image recognition and classification and are becoming the norm for the large scale image sets replacing algorithmic pattern recognition methods as can be appreciate from Google trends report.

Costume building ConvNets, however, have their own issues as they are often difficult to train and often require an architecture with multiple sequential convolutional layers demanding high computational powers. A more ideal scenario is to have a framework in which a pretrained ConvNet is employed to train a new classifier instead of training a network from the scratch.

This concept of using a pretrain ConvNet to train a new set of images is referred to as “transfer learning” , which is the method that I used for classification of the breast ultrasound images.

For my pretrain model, I selected Google InceptionV3 which has shown good performance compared to other ConNets (Figure 3). Google Inception-V3 has been trained on ImageNet which is a set including about 1.2 million images. In a recent article published in Nature, the authors reported the effectiveness of using InceptionV3 for classification of 129,450 clinical images to detect the skin cancer from the range of melanoma classes.

InceptionV3 performance in comparison on other deep ConvNets. Credit: Canziani and Paszke

I then trained a classifier on top of the ConvNet using softmax function in tensorflow and used the 0.5 threshold for binary classification. As mentioned earlier, compared to mammograms, ultrasound provide superior sensitivity (recall). Therefore, the goal for automated classification of ultrasound breast images is to optimize the sensitivity factor to be able to capture the malignant lesions without sacrificing much of accuracy.

Image Preprocessing:

Upon assessing the images, I found out that there was a dark margin with varying width and height for different images. Therefore, the first image preparation step was to detect the dark margins and cropping them out via algorithm that I wrote which, in a nutshell, 1) took the average of the rows and columns; 2) found the gradient of the averaged row and columns, 3) found the spikes at the beginning and end of rows and columns by flagging the gradients larger than three times the standard deviation (Figure 3). The cropping step yielded images with varying sizes since the dark margin was different as well. Other image preprocessing steps included grayscale conversion and optional augmentations.

Example of an raw image with dark margins
Cropped image after auto-removing the dark margins.

Image Augmentation:

The model was trained both with original images as well as a set of augmented images with augmentation steps that deemed meaningful for ultrasound breast imaging. This included 5 random repetitions of each of following effects: rotation(-20 to 20 degree) with random left-right flip, shearing, combination of shearing and rotation, translation, edge sharpening, and contrast. The augmentation steps were implemented by using imgaug module. After augmentation, the image size population increased by 31 folds including the original images.

Different augmentation steps applied on the original images.

Data Distribution:

10% of the the total data was holdout for test, 10% was assigned to validation and 80% for training set. After retaining on the train data, the model was tested on the holdout test and its performance was then assessed.

Pipeline Design:

In the design of the classifier pipeline, I took into consideration the level of information that is going to be shared with patients and their doctors separately. For the patients use, it is important to inform them on the nature of the breast image and whether it is normal or abnormal (with lesion). At the clinical stage, a more detailed information can be shared with the clinicians on the type of the lesion. Therefore, the classier that I considered was a 2-tier classifier as shown below. The normal classier yielded an accuracy of 98.5% which can deliver high confidence results to the patients. The sample size population of the training data for the normal label was two times as large as the sum of benign and malignant images constituting the tumor train set.

2-Tier classifier: patients are given information on the nature of their images (normal versus abnormal). If classified as abnormal breast tissue, clinicians will then receive information on the type of the tumor.

Model Performance:

InceptionV3-retrained classifier was directly compared with a simpler neural network that was previously tested on this data set. The architecture of the costume-built ConvNet consisted of three convolution layers. For the matched comparison , the same pickled data used with the previous neural net was examined with the InceptionV3-retrained classifier. The data set was 12 times of the sample size of the original images obtained by random rotations and each image downsampled by factor of 5. With Inception-V3 model (current), an improved recall by 4 points (65% to 69%) was obtained over the custom-built ConvNet (old).

I also evaluated the model based on just using the original images without any downsampling or augmentation. Downsampling of images, although saves training time, it reduces some of the features that could be beneficial to better train the model. As a result, eliminating the downsampling resulted in further improvement of recall to 73% which is the reported sensitivity that is clinically achieved for breast examination with ultrasound. For further evaluation of the model, I increased the sample size to 31 times larger than original data by applying the augmentation steps (as described above). This resulted in even further improvement of the recall to 75%.

Performance evaluation InceptionV3-based model (Current) and costume convolutional neural net with 3 convolution layers (Old) with a matching data . Model performance further improved after eliminating downsampling (current+). Augmentation of the images resulted in even higher recall as the expense of longer time for training.

The effect of augmentation was more apparent for a higher recall value achievable by lowering the threshold for the softmax classifier from the conventional 0.5 value. As shown in the figure below, lowering the threshold for malignant classification to 0.44 can result in a 80% recall value for both models trained with original data set (left) and with augmented data set (right). For augmented-trained model achieving this high recall value scarifies the specificity to 71% (as opposed to 75% with softmax threshold of 0.5). As for the model trained with original data set (no augmentation) increasing the recall to 80% results in a low specificity of 63% (74% with softmax threshold of 0.5).

ROC plots of the test data classified with model trained with original images with no augmentation (left) and the model trained with augmented image set (right).

Final Remarks:

For the given image set:

•Transfer learning vs. 3-Conv Net → improved Recall

•No down-sampling → preserving features →better recall

•w/o augmentation → same recall → saved training time

•Augmentation → improving specificity for a given recall

Future work:

•Multiple-class classification for each etiology subtype