Poster: Image Recognition Techniques on Digital Images of Colon and Stomach Biopsies

Poster design Tunke Lauriks, Kan Design Antwerp Belgium

Background

Digital Pathology

2017 was the breakthrough year of Whole Slide Imaging (WSI) devices. WSI devices are now able to quickly turn glass slides into high resolution digital images. These digital images can be viewed, analysed and managed in ways physical glass slides can not. The practice of acquiring, managing, sharing and interpreting of pathology information in a digital environment, known as digital pathology, has received increased attention since the breakthrough of WSI. Digital pathology offers many new advantages. If analytic tools, machine learning and artificial intelligence techniques are available and easy accessible to pre analyse slides more pathologists will be persuaded to embrace these new technologies, known as computational pathology.

Machine Learning

Machine Learning (ML), a subfield of Artificial Intelligence (AI), is the science of getting a program that improves its performance by making inferences from data, ‘learning from data’. Hardware becoming more and more powerful, increase in digital data and development of artificial neural networks all contributed to accelerated ML activity. With ML techniques it is possible to recognise patterns in digital images and to classify digital images based on their contents with high accuracy. However for algorithms to perform high accurate recognition and classification tasks, a large number of labelled data is necessary.

Aim of the pilot study

The aim of this pilot study is to look into the possibilities of training an algorithm on digital images of colon and gastric biopsies labelled as abnormal and normal. The data set consists of high quality images labeled by acknowledged highly skilled pathologists. If the algorithm classifies new, unseen images with sufficient accuracy and sufficient precision, we will start a full project with as final aim creating an end-to-end product that pre analyses Whole Slide Images.

Method and Material

Data set

The data set consists of digitale images taken from parts of whole slide scans of gastric mucosa and colon mucosa biopsies (jpg format, average size 400KB). All patient data was omitted, making the images fully anonymous. These digital images were taken at different magnifications 1x, 2x, 5x, 10x and 20x. Each digital image was labelled as gastric mucosa abnormal/normal or colon mucosa abnormal/normal.

The series of gastric mucosa biopsies, data set A, included biopsies without abnormalities labelled as normal, and biopsies with inflammatory lesions. Images of the latter were labelled as abnormal if the following features were present:

  • increased number of inflammatory cells
  • interstitial oedema
  • differentiation abnormalities of the epithelial lining

The series of colon mucosa biopsies, data set B, included biopsies without abnormalities, labelled as normal, and biopsies with inflammatory lesions, hyperplastic polyps, adenomatous and villous polyps and malignant lesions, labelled as abnormal. Images were labelled as abnormal if showing one or several of the following histological features:

  • presence of aberrant glands: distortion of the glands (dilatation, branching), presence of villous structures or both
  • differentiation abnormalities of the epithelial lining: (decreased numbers or absence of goblet cells, cellular atypia)
  • increased number of inflammatory cells

Images were collected as shown in table 1.

Method

Because of the limited number of images we choose to use a pre-trained convolutional neural network[1] to extract features[2] of the images. We used the VGG16 model[3] with weights pre-trained on ImageNet data set[4] with the default input size for this model 224x224x3 (image height, image width, colour channels). We used RGB colour channels and did not convert to grayscale.

The extracted features were fed to different models classifying the images into 2 categories: abnormal or normal.

Using all the images at different magnifications did not lead to acceptable accuracy on the training set. Neither did training the models on the images at low magnifications (1x, 2x, 5x). Therefore these images were omitted from the data sets. The remaining data sets comprised images at magnification 10x and 20x. As such data set A contained 20 abnormal and 31 normal gastric mucosa images. Data set B contained 73 abnormal and 19 normal images of colon mucosa. To avoid information loss due to downsizing the images, the images were split into 6 tiles. This resulted for data set A, gastric mucosa, in 116 abnormal tiles and 185 normal tiles.

In data set B we checked each tile. Some tiles, belonging to an image labelled abnormal, didn’t show any characteristics of abnormal tissue and were relabelled as normal. Some blank tiles were removed from the data sets as well as tiles more than 90% blank. This resulted for data set B, colon mucosa, in 407 abnormal tiles and 134 normal tiles. Both data sets were split into a train set (80%) and a test set(20%).

To train the different models each train set was enlarged 5 times with data augmentation[5]: rotation, horizontal and vertical shift, zoom and flip. We choose these data augmentations because the models have to recognise patterns which are translation insensitive. The resulting ‘empty’ sections were filled with constant mode (background colour) and with ‘reflect mode’. Using ‘reflect’ mode resulted in slightly better performance on the train set.

We trained 3 different models using VGG16 features:

  1. K-Nearest Neighbour[6] classification with n=3 (number of neighbours to use)
  2. Support Vector Machine(SVM)[7] with rbf kernel, c=1 and gamma=1.3
  3. Random Forest Classifier[8] n=300

The models were trained using 5fold cross validation[9].

Results

The SVM model returned the highest accuracy on the test set.

Because of the complexity of the patterns to be recognised, we decided to also extract features using the ResNet50 model[10], pre-trained on ImageNet. The ResNet50 model is a deeper network resulting in higher-dimensional features than the VGG16 features. We used data set B to make the comparison. The ResNet50 features were fed to the SVM classifier and resulted in a slight improvement: 91,7% accuracy on the training set and 94,9% on the test set.

Interpretation

Because of the small and unbalanced data set the results have to be interpreted with caution. The SVM model returns the best results on new images. An accuracy score of 75% was obtained for the gastric data set, an accuracy score of 94% for the colon dataset.

This difference in result can be explained by the higher number of available images in the colon data set. Relabelling tiles before training the model can also explain the better results for data set B. As well as using ResNet50 features.

To be noted: in data set B only 1% of abnormal labelled images was misclassified. This is important if the model has to serve as a pre-analytic tool. Despite the unbalanced data set, 81% of normal images were recognised.

Finally we like to stress that these results could only be obtained because of the high quality of the HE sections of the gastric and colon mucosa biopsies, the high quality of the digital images and a correct labelling.

Conclusion

When digital images of gastric and colon mucosa are labelled by skilled pathologists, it is possible to train a simple model with a small data set that performs with acceptable accuracy on classifying digital images as abnormal or normal.

With the results of this pilotstudy on a limited number of images we are now building a model that will analyse whole slide scans, so that we can construct an end-to-end product which will significantly streamline histological analysis.


Extra information

  1. 1. A convolutional network (CNN) is a deep artificial neural network used for image analysis. A pre-trained convolutional neural network is a neural network that is trained on a large data set, for a large-scale image classification task. If this data set is large and general, the CNN has ‘learned’ a spatial feature hierarchy which can be used in different computer vision problems with different and smaller data sets. Meaning using a pre-trained network allows the new network to start learning from previously learned patterns in stead of learning from scratch. A CNN takes as input a 3dimensional object. Each image can be represented as a 3D object: height in pixels, width in pixels and RGB colour channels. Therefor each image is represented by 3 2dimensional matrices stacked together. If the image is a black and white image, the image will be represented by 1 2dimensional matrix.
  2. 2. Feature extraction is the process of transforming a image into a set of features which represent the image very well. The 3 stacked matrices will be converted into a singel multi dimensional vector using feature extraction.
  3. 3. VGG16 is a CNN, 16 layers deep. Ref.: ‘VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION’ Karen Simonyan∗ & Andrew Zisserman+ Visual Geometry Group, Department of Engineering Science, University of Oxford {karen,az}@robots.ox.ac.uk. https://arxiv.org/abs/1409.1556
  4. 4. ImageNet is a data set of over 14million url’s to images, classified into over 20.000 categories. http://image-net.org/index
  5. 5. Data augmentation is a technique used in machine learning to enlarge a data set. Small alternation are made to the original data, like flipping, rotating, shifting an image. A large data set makes the algorithm learn more relative features so that it will be able to generalise better on new unseen data.
  6. 6. K-Nearest Neighbour algorithm is a machine learning technique used for classification and regression. http://scikit-learn.org/stable/modules/neighbors.html#neighbors: ‘Despite its simplicity, nearest neighbours has been successful in a large number of classification and regression problems, including handwritten digits or satellite image scenes.’
  7. 7. A Support Vector Machine (SVM) is another machine learning technique used for classification and regression. The technique is used to find the optimal hyperplane that separates to classes with a maximum margin between data points from each class. Ref.: ‘SUPPORT-VECTOR NETWORKS’, Corinna Cortes, Vladimir Vaspnik AT&T Labs-Research, USA http://homepages.rpi.edu/~bennek/class/mmld/papers/svn.pdf, more info: http://scikit-learn.org/stable/modules/svm.html, http://www.svms.org/history.html
  8. 8. A Random Forrest Classifier is another machine learning classification technique. With the input images and their labels, the classifier builds multiple decisions trees and thus creating a forest. Doing so it formulates sets of rules which are then used to predict labels of unseen images. Ref.: ‘RANDOM FOREST’, Leo Breiman Statistics Department, University of California Berkeley, CA 9472, January 2001. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
  9. 9. Cross-validation is a technique to evaluate the performance of a classifier. The data set is split into a training set to train the model, a validation set and a test set to evaluate the performance. After training the classifier the parameters of the classifier can be altered to enhance performance. Finally when the performance isn’t improving anymore the classifier’s performance is evaluated on the test set. 5fold cross-validation means splitting the training set into 5 subsets. The classifier is trained on 4 subsets and evaluated on the 5th subset. 5fold cross-validation means training the classifier 5 times each time using a different subset as validation set and the remaining subsets as training data. Ref.: Stone M. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc., 36(2), 111–147, 1974.
  10. 10. ResNet50 is a residual neural network with 50 layers. Ref.: ‘Deep Residual Learning for Image Recognition.’ Kaiming He Xiangyu Zhang Shaoqing Ren Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com https://arxiv.org/pdf/1512.03385.pdf
  11. 11. Precision and recall can be used to express the performance of a classifier. Precision = true positives / (TP + FP) and Recall = true positives / (TP + FN).

Recall is a measure for the model’s ability to find all images from the data set belonging to a given class. While precision expresses the proportion of the images the classifier says belong to a given class to all images belonging to that class.