AI in Medicine — Majority decision isn’t always right

Susan Ruyu Qi
Health.AI
Published in
5 min readJul 17, 2018

Following their major publication in JAMA that marked a major breakthrough in both the AI and healthcare communities, Google made some fine tuning to their Deep Learning model and published their new results on Arxiv. *This article was then published in Ophthalmology, one of the most important journals in the field, with an impact factor of 6.1.

I had explained in a previous article how Google designed their initial AI model to detect diabetic retinopathy (DR) from fundus photos. Here’s what’s new in Google’s AI 2.0:

1. Redefining the Gold Standard Using “Adjudication”

Garbage in garbage out. We all know that AI is biased if trained using inaccurate labels; it can even be dangerous in medicine.

While having high-quality ground truth label is critical for training machine learning models, it is easier said than done since medicine is often subjective. Take diabetic retinopathy for example, doctors will most often agree when lesions are obvious: this one has retinopathy and that one does not. However, when asked to grade the disease on a scale of 1–5, disagreements occur. On the image below, Google showed that for the same image of the retina, different ophthalmologists will grade the image differently, consistent only around 60% of the time with themselves and with others.

Each row is an image, each column is an ophthalmologist grader. Colours represent the severity grades given by each ophthalmologist.

That is, in the most part, due to the subjective variance in exact definition of grades and boundaries between the 5 different grades. For example, while mild DR is defined as “having microaneurysms”, image artefacts often resemble microaneurysms and was a common source of disagreement. Moderate DR is defined as “more than microaneurysms but less than severe NPDR”, which is also open to interpretation. Luckily, clinical care accounts for much more than a simple image. Treatment plan is personalized for each patient according to his/her age, medical and family history, disease progression, diabetes control, and a much more thorough eye exam using other imaging modalities (OCT) and direct stereoscopic examination of the retina after dilation. Although not affecting patient care, disagreements in grading scores do make research and creating image label much more difficult.

When doctors disagree, who is right? Who’s answer should we label as ground-truth?

Traditionally, taking the “majority decision” has been a popular method for defining the reference standard. For example, Google had hired enough ophthalmologists to have each image of their initial dataset (128k images) read by 7–8 different people, independently. They then took the majority decision as the final label for each image. This method is flawed as it does introduce a bias where the algorithm will miss subtle findings that the majority of ophthalmologists might not identity. In their second study, Google suggested a more rigorous way to define the “gold-standard”, through adjudication. Using this method, instead of taking the majority grading when there is disagreement, doctors will decide together, face to face, to conclude on a final decision.

They tested this adjudication process on a subset of around 6000 images. The images were first evaluated independently by 3 fellowship-trained retina specialists who then discussed face-to-face to resolve disagreements and determined a “final diagnosis”.

3737 images with adjudicated ground-truth labels were used as the tune set; ie: it was used for tuning the algorithm hyperparameters (e.g. image resolution, learning rate) and making model choices (e.g. network architectures), but not for training the model parameters.The rest of the images with adjudicated grading were used as the validation set.

In this study, they demonstrated that model performance was significantly improved even if only a small subset (0.22%) of the training image grades were adjudicated.

2. Upgrade from Binary Prediction to a 5-class Rating Prediction

Instead of a binary “referable” vs “non-referable diabetic retinopathy”, in this second study, the Google team trained a 5-class prediction model that can grade an image’s disease severity: none, mild, moderate, severe, and proliferative. This is in accordance to the most commonly used International Clinical Diabetic Retinopathy (ICDR) disease severity scale. This makes the model more suitable for clinical practice.

3. Bigger Data Set (1.6M Images)

The training portion of the development set was increased from 128 175 images in their first study to over 1.6M images in this study. It contains fundus images from 238 610 patients.

Input resolution in this new model is 779 x 779 pixels, a large increase over the 299 x 299 pixels used in their previous study. The model architecture was also upgraded, from Inception-v4 to Inception v316.

Overall, for each of the five gradings, the algorithm has AUC values between 0.986 to 0.998. In other words, the algorithm is capable of labelling all five grades of the disease with high sensitivity and specificity!

Future: an Increased Need for Ophthalmologists:

When tele-ophthalmology was first introduced, it allowed patients to be screened remotely using fundus photos, making care much accessible to rural populations. Consequently, it has also allowed more diabetic retinopathy cases to be diagnosed and referred to retina specialists for closer examination, treatment and followups. This has overwhelmed many ophthalmology clinics’ already busy schedule and wait-lists. The AI algorithm, through high productivity and low cost, will make screening even more accessible to patients worldwide.

Diabetic retinopathy is the leading cause of blindness in working-age adults.

DR is an insidious disease that slowly damages the retina, leading to symptoms only in late-stage, when damages have become irreversible. Current guidelines therefore recommend yearly screening to all diabetic patients. In real life however, less than one third of patients are getting this required exam (Ontario, Canada).

By 2040, it is approximated that 600 million people will have diabetes, with one-third expected to have diabetic retinopathy. -Ting et al. JAMA.

With 600 million people all requiring yearly screening, along with an increased life expectancy, and an aging population, the demand for ophthalmologists will be higher than ever. As a future ophthalmologist, I’m glad that AI algorithms now exist to assist me in the future, freeing me from the repetitive pattern-recognition tasks, and can allow me to focus on patient care and innovative research.

Read more from Health.AI:

Deep Learning in Ophthalmology — How Google Did It

Machine Learning and OCT Images — the Future of Ophthalmology

Machine Learning and Plastic Surgery

AI & Neural Network for Pediatric Cataract

The Cutting Edge: The Future of Surgery

--

--

Susan Ruyu Qi
Health.AI

MD, Ophthalmology Resident| clinical AI, Innovations in ophthalmology and vision sciences