Ocular Disease Intelligent Recognition (ODIR-2019)

Yannick Bodin
DataScientest
Published in
11 min readJun 5, 2023
Close-up of a human eye (Credits: Unsplash)

Introduction

Over the years, the development and improvement of Artificial Intelligence (AI) techniques has grown in healthcare with AI-powered tools promising to bring significant improvements, whether in image analysis, robotics-assisted surgery, patient monitoring, medical device automation, personalized-medicine and drug identification.

In this context, the Peking University organized in 2019 the “Peking University International Competition on Ocular Disease Intelligent Recognition (ODIR-2019)”(3) for the development of an AI for image-based diagnosis using a dataset of eye-fundi.

This dataset is a structured ophthalmic database of a “real-life” set of 5,000 patients from different hospitals and medical centers across China, with information regarding their age, as well as color fundus photographs (fundoscopy) from left and right eyes and diagnostic keywords from doctors.

In the following, we try to identify different fundus images using computer vision. This work has been done with my colleagues Okba Bentoumi and Thibault Vausselin and supervised by Anthony Jaillet. It corresponds to the final project of the data scientist course at Datascientest company. The development took about 100 hours and allowed us to use our newly acquired knowledge in a practical application.

Projects authors:

Eye anatomy and ophthalmic diseases

Fundoscopy is an examination of the retina allowing for the diagnosis of eye diseases and the identification of vision-loss associated risk factors.

As the light enters the eye, it passes through the crystalline lens that refracts it. With age, a condition can appear, named cataract, as this lens opacifies, causing a blurry vision and eventually blindness.

When refracted, the light is focused onto the macula, a specific spot of the retina which can also be altered with age, causing a disease with vision loss or impairment named Age-related Macular Degeneration (AMD).

Once it hits the retina, the light is converted into information transmitted to the brain through the optic nerve which can be damaged in some diseases such as glaucoma.

The retina is also home to the cilioretinal arteries that supply it with blood. Diseases affecting the morphology, shape or diameter of the blood vessels, such as hypertension or diabetes, will affect cilioretinal arteries and eventually lead to blindness.

Finally, modifications in the shape of the eyeball can prevent the light from being focused onto the retina and damaged. This happens in myopia which is marked by choroidal thinning and morphologic changes in the retina.

By being able to analyze the eye-fundi, it is possible to detect, prevent and treat these diseases, making fundoscopy a cost-effective examination.

(For further details on the eye anatomy : https://www.exetereye.co.uk/the-eye/eye-anatomy/).

Aim

Due to the severity of the evolution of these diseases, it is detrimental to properly detect them to prevent severe vision impairment.

The aim here is to develop an AI-based classification of the diseases based on the dataset provided.

Data description

Within the dataset are 8 columns related to “labels” : N, D, G, C, A, H, M and O, corresponding respectively to the terms normal, diabetes, glaucoma, cataract, AMD (Age-related macular degeneration), hypertension, myopia and other diseases/abnormalities.

The labels are determined by the following rules:
a: The classification labels of one patient depends on left and right fundus images and corresponding diagnosis keywords given by medical doctors.
b: One patient is classified as normal if and only if both left and right diagnosis keywords are “normal fundus”,
c: The classification labels are decided by the other fundus image when one of fundus images is marked as “normal fundus”,
d: Treat all suspected diseases or abnormalities as diagnosed diseases or abnormalities.

Dataset overview

Each row of the dataset corresponds to one patient and contains information related to the age, sex, eye-fundi (left and right), the diagnostic keywords assigned by the medical staff, and their corresponding labels.

A first look at the dataset info and descriptions shows that no data seems to be missing or duplicated.

Sex-related data are dtype ‘object’ and may need to be converted to ‘int’ for further analysis, but the mean distribution for each label already shows a class imbalance problem (see mean N to O, df.describe()).

  • Age distribution of the population
Figure 1. Age distribution of the total population (left) or gender based (right).

The distribution curve exhibits a non-normally distributed bell-like shape with a peak centered around 55–60 years (57.9, see df.describe()) and a smaller peak observed at the very early stage of life.

The clinical relevance of this distribution requires an expert opinion to assess its medical relevance. However the main peak may correlate with the age of onset of most diseases, thus requiring medical assessment, hence the average age of the general population reported in this dataset.

  • Sex-based age distribution of the population

The sex-based distribution is similar when comparing both groups over 60. However, under 60, this distribution shows a prevalence of men within the population.

Such observation again requires an expert opinion to validate its clinical relevance.

  • Distribution of the labels
Figure 2. Age-related distribution of the overall population based for each label (red & green) or based on the patient’s sex (green shades)

For each label, the general distribution of both positive and negative populations seems similar in shape.

The distribution of the cataract-positive population however seems deviated to the right, towards elderly population, when compared to the negative population which seems relevant with the pathophysiology and age of onset of the disease.

Yet, remarkably, no such observation is made for the AMD, another pathology occuring at older ages.

The previously reported class imbalance is also visible, with the labels “Normal”, “Diabetes” and “Others” assigned to around 30% of all patients while other labels represent about 3 to 6% each

  • Diagnostic keywords

There are 227 and 233 individual diagnostic keywords used by the medical team for the left and right fundi, respectively, with each keyword being possibly associated with others.

Using a “Wordcloud” we see a strong predominance of the terms “normal fundus” as well as “retinopathy” (a term related to diabetes) in both left and right fundi.

Figure 3. Word Cloud of left- and right- fundi’s diagnostic keywords.

Terms related to technical issues, such as “lens” and “dust”, also seem more predominant than some disease-related terms (“myopia”) indicative of potential bias / issues to come in the analysis of the images.

Finally, we see slight differences in some terms such as “degeneration”, “cataract” or “glaucoma” when comparing left and right fundi.

Problem 1

In this project, the relevance of the data should first be discussed and validated with medical experts.

However, with 200+ diagnostic keywords for each fundus and the label assignment relying on the diagnostic keywords of both funds, it is important to determine the actual correspondence between labels and keywords.

Thus, for this project, we will start addressing this issue in order to optimize the training of a future classification model.

To do so, it appears essential to split both left and right eye fundus apart in order to obtain a fundi-based dataset where each row corresponds to one eye fundus and its related info, rather than to one patient with both eye fundi and their info, allowing for proper classification for each eyes.

Processing of label data

To achieve our goal, we decided to :

  1. Identify all possible label combinations within the dataset and determine their weight.
  2. Identify all diagnostic keywords used and which labels they are predominantly assigned to.
  3. Re-assign each label with its diagnostic keywords.
  4. Properly label each individual eye-fundus based on its individual diagnostic keywords.

Once the processing is finished, we obtain a data frame composed of 6077 rows. Each line indicates the diagnosis of the eye and the name of the image. All the images having several diagnosis or a comment of bad image quality have been deleted.

Image overview

Now that our dataset has been cleaned, we checked the correspondence between the images and their corresponding information within the dataset as well as the quality and shape of the images.

Doing so, we noted that the images are not uniform in shape, size or quality with area corresponding to the actual fundus being more or less zoomed in or out depending on the images. This observation could be due to the fact that eye fundi were collected from different medical centers and hospitals and so different kinds of acquisition devices.

Problem 2

These differences, from the image quality to each disease’s characteristics, makes the processing of the images critical in this project.

Moreover, due to the class imbalance, it is important to generate new images for each disease/label prior training our model to avoid any over-fitting.

Image processing

To do so, we decided to first crop and center the images to have images of roughly the same shape centered around the area of interest, i.e. the fundus.
Then we equalized the images to improve contrast in images based on the number of pixels in each type of color component.
Finally, we resize the shape of all images to 224X224.

The figure below shows the number of fundus images per class:

Once processed, we balanced the classes through image data augmentation using python ‘DataAugmentationGenerator’ with modifications on rotation, size as well as brightness, contrast, color saturation, hue, gamma and crop.

The code below shows the parameters applied to the generators:

datagen = ImageDataGenerator(
data_format='channels_last', rotation_range=40,
height_shift_range=0.3, width_shift_range=0.2,
zoom_range=0.2, brightness_range=[0.4,0.9],
fill_mode='constant')

Given the imbalance of the classes, a different number of images was generated depending on the population per class.

Below, the number of classes after the rebalancing operation:

Problem 3

In order to correct the class imbalance, we generated further images in the minor classes, generating 18000 pictures total, from 6000 initially. However, such a large number of pictures might impair the training efficiency of our model.

To address this issue we see 3 options :
- as done above, we can use generators which will require less RAM memory capacity.
- we can convert the images into a NumPy table allowing for an increase in the training speed,
- limit the number of images used for the training (e.g. 100 images per class).

The image below shows the complete process of data processing:

Modeling

Convolutional Neural Network is typically used for image classification. This subtype of Neural Network is made of several alternating convolutional or pooling layers along with filters used for detecting the images features.

Overall, the process extracts the most relevant features from the images and provides output for classification. The network will be composed of two parts: a first part for feature extraction and a second part for classification.

Keras and Pytorch are the most common libraries for creating such a network. But this can be time consuming and expensive.
Therefore, pre-trained models have been developed by learning from large databases (such as ImageNet trained on over 14 million images) and made available.
Here, we have chosen to use some of the pre-trained models for the feature extraction part, among which VGG16, VGG19, RESNET50, INCEPTION, XCEPTION…

As for the classification part, it will contain as many output layers as there are diseases/labels. However, to determine the number of hidden layers and the number of neurons in each of them, we used tools such as “Tensorboard”.
Tensorboard is a Tensorflow visualization tool that allows the graphical visualization of our models’ metrics as a function of the network’s hyperparameters.
To evaluate the overall performance of our model, we used the recall and f1-score metrics.

Using Tensorboard, we found that the number of layers does not improve the model, while the number of neurons does improve the accuracy. However, if the number of neurons increases, we may reach an overfitting of the model.

Results

The training phase was performed taking into account the following metrics:

  • Accuracy
  • Precision
  • Recall

The chosen optimizer is the SGD optimizer with the following parameters:

tf.keras.optimizers.SGD(learning_rate=0.001, decay=1e-6, momentum=0.9, nesterov=True)

The result obtained for the Loss value for each models:

The result obtained for the F1 score value for each class and for each models:

The result obtained for the recall value for each class and for each models:

And finally the accuracy obtained for each model:

Below, you find the confusion matrix for the ‘Vgg19’ model :

Conclusion

Our results show a good training of the model with good results on the classification overall. However the confusion matrix shows a lower efficiency when it comes to discriminating against some diseases, such as Glaucoma and Myopia. And this observation is consistent no matter what the model used is. In our opinion, this issue shows the importance of an expert opinion in this project.

Indeed, though myopia-related eye fundi clearly display large bright plaques due to the choroidal thinning, those related to glaucoma also seem to display brighter “traces”. [1] Interestingly, a quick research on the topic shows that glaucoma is characterized by modifications in the cup/disk (C/D) ratio, right where the optic nerve leaves and blood vessels enter the retina. Such information tends to indicate that image segmentation would have been a more appropriate way to process the images in order to help the model to focus on areas of interest for the classification process.

Moreover, in the eventuality of a future deployment, it seems critical to have expert opinions on the relevance of our data when it comes to the sex- and age- repartition of the diseases and population as well as help determine whether some of this factor should be taken into account in the model development.

References

  1. Bajwa J. et al. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthcare Journal, 2021 Vol 8, No 2: e188–94.
  2. Bohr A & Memarzadeh K. The rise of artificial intelligence in healthcare applications. Artificial Intelligence in Healthcare, 2020, Pages 25–60 Chapter 2.
  3. ODIR-2019 source : https://odir2019.grand-challenge.org/

All the realized code is available on our different github repositories:

--

--