Evaluating AI-Powered Autonomous and Assistive Diagnostic Tools — Performance Metrics

Published in

Swish Labs

11 min readFeb 8, 2019

As more and more companies launch AI-powered assistive and autonomous diagnostic software products, care providers will be increasingly called upon to evaluate them. There are standard performance metrics vendors provide, and I will be stepping through how those values can be used to compare and select products. There are many other considerations such as workflow integration and processing speed, but before you even get into that, you need to know how much predictive power the system has. As much as possible, I will highlight purchasable products to anchor the options around what is presently possible for care providers to use. Let’s start with a few AI-driven products for motivation.

The first autonomous diagnostic AI system to receive FDA approval is IDx-DR, a tool for detecting diabetic retinopathy (mtmDR). In the announcement, the FDA remarks, “IDx-DR is the first device authorized for marketing that provides a screening decision without the need for a clinician to also interpret the image or results, which makes it usable by health care providers who may not normally be involved in eye care.” Putting my hat on as a clinician who might be considering using this product, I need to think through a few questions — How well does it perform relative to human ophthalmologists? How well does it perform on an absolute basis? For IDx-DR, the published sensitivity score is 0.87 and specificity is 0.90. These metrics will help us answer the suitability questions.

A second product, this one focusing on arrhythmia detection, and meant to be used as an assistive device, is the Zio EKG monitoring patch by iRhythm. Impressively, they report that, “A new study published in Nature Medicine shows that a deep learning algorithm can learn to detect and classify arrhythmias with the accuracy of a cardiologist.” They provide the average area under the receiver operating curve of 0.97, and remark that “With specificity fixed at the average specificity achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes.” The AUROC metric will help us understand the predictive value of the software.

Quick Break To Go Over How To Go Over The Numbers

There are two main things that are of interest for model performance.

How does it stack up relative to humans?
What is the performance on an absolute basis?

If it’s not on par with people, it’s going to be hard to justify the clinical use. If both people and algorithms are bad, then there’s also of course little benefit. Where it starts to get useful is when performance achieves human and superhuman capabilities, and where the absolute values are consistent with clinical goals — that is, when the true positive and false negative rates are appropriate.

Three metrics help answer the relative and absolute comparison.

Sensitivity: Sensitivity, also called the ‘True Positive Rate’, is the number of items labeled as positive out of all the positive observations: TPR = TP/(TP + FN). Everything that is actually positive is the sum of the items correctly labeled as positive plus positive items incorrectly labeled negative.
Specificity: Specificity, also called the ‘False Negative Rate’, is the number of items labeled as negative out of all the negative observations: TNR = TN/(TN + FP). Everything that is actually negative is the sum of the items correctly labeled as negative plus the negative items falsely labeled as positive.
Receiver Operating Curve: ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.

A number of articles and videos cover these topics nicely.

Interactive ROC visualization, link
Narrated explanation of the ROC visualization, link.
Wikipedia entry for Sensitivity & Specificity, link.
Wikipedia entry for RoC, link

There are a handful of related metrics, precision and the f-measure, which I won’t discuss, but that are also frequently provided and reveal similar insights about model performance.

Whenever either a human or a model categorizes something, there are two things that can happen: they either get it right or wrong. Let’s forget about man or machine for a minute, since the numbers mean the same thing regardless of how the observations are classified. In the set of items being classified, there will be a mix of items, some of which have the the thing we are trying to label, and some that are not. What we are interested in is, for the things that are the thing, how many did we get right? And out of the things that are not the thing, how many did we get wrong? If we are looking at cars we might try to label them as new or used, if we are looking at a pile of clothes we may want to separate them into darks and whites, if we are examining patients we want to identify sick and healthy. In all of these scenarios, to measure performance we need to know 1) The number of actual items per class (eg new, not-new; dark, not-dark; health, not-healthy), 2) The number of items correctly or incorrectly labeled per class.

Evaluating Vendors

With that as a review of the comparison concepts, let’s take a look again at the IDx and iRhythm metrics mentioned previously, a product for electronic health record predictive analytics, Dascena Insight, and a dermatology model — and evaluate their relative and absolute performance.

IDx-DR/Diabetic Retinopathy

As mentioned in the intro, DBx-DR is an FDA approved system for the autonomous diagnosis of diabetic retinopathy. Let’s look at how well it performs on a relative basis and an absolute basis. The reported Sensitivity and Specificity values are 0.87 and 0.90 respectively. From a recent Google publication, trained ophthalmologists have comparable results. In the ROC from the paper displayed below, the colored dots are the ophthalmologist results and the blue dots are retinal specialists.

Therefore, for the relative performance, the system appears to perform at a human level. However, it would be helpful to have a larger study of ophthalmologist performance in order to be certain.

iRhythm/Cardiac Monitoring

Switching gears to cardiology, let’s take a look at the metrics provided by iRhythm to understand the quality of their classifier. For the relative performance, an interesting thing with an AI model is that you can specify the TPR and maximize the FPR or you can specify the FPR and maximize the TPR. In the research by iRhythm they in fact did this, setting the TPR at cardiologist level, and measured if the FPR was better or worse than cardiologists. The result they found was that “Fixing the specificity at the average specificity level achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes.” What this means is that when they locked the false positive rate at the average cardiologist level, their model had a higher true positive rate than the average cardiologist. How much higher though? The table of sensitivities and specificities lets us figure that out.

Let’s imagine we have a set of 100 patients with Atrial fibrillation and flutter, and 100 without it, and those are the two classes. Taking the first line of results that corresponds to Atrial fibrillation and flutter, what the table tells us is that the average cardiologist would have correctly identified 71 out of 100 patients that actually had AFF (a 71% True Positive Rate) and incorrectly labeled 5.9 patients who did not have AFF as having AFF (a 5.9% False Positive Rate). The iRhythm model, since the False Positive Rate was fixed, would also have indicated 5.9 patients had AFF who in fact did not have it. The model would have performed better with identifying patients out of the 100 who in fact had AFF, however, labeling 86.1 of them correctly (an 86.1% True Positive Rate). The significance of this is that with no worse of a false positive rate, the model had a higher True Positive Rate. The output of the model shows a similar story for each of the 12 arrhythmias the model classified (all rhythm classes).

Dascena/Sepsis

While I’ve been evaluating image-based diagnostic models, similar metrics are used to evaluate other AI systems. Let’s consider Dascena’s platform for predicting the onset of Sepsis from electronic health records, Sepsis Insight. In the performance section of their product offering, they report Sensitivity of 90% and Specificity of 90% at onset. In a joint study with the University of California, San Francisco (UCSF) Medical Center (San Francisco, California, USA) they set up a trial “..designed to demonstrate the superiority of using an algorithmic predictor relative to the hospital’s current EHR-native rules-based severe sepsis surveillance system.” To evaluate the differences, we can look at the AUROC, Sensitivity, and Specificity scores across the systems.

Using physiological data collected during the study from the enrolled participants, the algorithm more accurately detected severe sepsis than MEWS, SIRS criteria, the SOFA score or the qSOFA score in a retrospective analysis. Breaking down the reported values, we can understand the improvements more specifically. Let’s imagine that there are 200 patients in the hospital, of which 100 have an onset of severe sepsis, and 100 who do not. First — by looking at the AUROC, 0.952, it’s evident that the machine learning algorithm (MLA) by Dascena does a good job of separating the two classes (sepsis onset, not sepsis onset). In comparison, the other systems had lower true positive rates and higher false positive rates. Across all the combinations of metrics, the MLA outperformed.

In looking at a specific value from the ROC, the 0.9 Sensitivity and 0.9 Specificity listed in the table, we know this means that out of the 100 imaginary patients with sepsis the MLA would correctly identify 90 (True Positive Rate of 90%) and omit 10 (False Negative Rate of 10%, 1 — Sensitivity). For the hypothetical 100 patients without sepsis, the model would flag 10 as having sepsis who are in fact healthy and don’t have sepsis (False Positive Rate of 10%, 1- Specificity).

Stanford/Dermatology

Returning to image-based classification approaches, let’s evaluate how well models are able to diagnose skin cancer. The Stanford ML Group built a deep learning model that matched the performance of dermatologists at skin cancer classification.

“We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: malignant carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer.”

The results demonstrate that their model outperformed the average dermatologist as can be seen in the ROC diagrams. One thing of interest to call out, is that these plot the True Positive Rate versus the True Negative Rate. The prior ROC diagrams plotted Specificity vs. 1- Sensitivity. The information conveyed is identical, with the low false positive rate corresponding the right side of the axis instead of the left.

Benefits and Considerations with AI-Driven Autonomous and Assistive Diagnostics

Benefits

AI can help address physician shortages, particularly among underserved populations. Brazil, for instance, recently deployed a teledermatology platform and cleared a backlog of 60k patients with their computer vision application.
Models can be run against historical records, as there is a low marginal cost for processing existing images.
Other benefits, in Google’s words, include: Increasing efficiency, reproducibility, and coverage of screening programs; reducing barriers to access; and improving patient outcomes by providing early detection and treatment (link).

Comparing Vendor Performance

When multiple vendors have a diagnostic solution for the same disease or condition, the metrics we have been going through, sensitivity, specificity, and AUROC, can be used to compare the performance between them. One of the challenges that comes up in doing this is that the datasets vendors use to benchmark against are often different, and the thresholds chosen for reporting sensitivity and specificity can be different. Additionally, in speaking with data scientists about vendor comparisons, the observation has been made that since every group of patients will be different than the datasets that vendors train and evaluate their models against, a good step to take is to run a representative set of about 1k patient data points through the models to generate metrics for your specific patient population. In this way, you will be able to see how the model will likely perform in your own clinical use.

Liability

Where does the liability life for mistakes made by an autonomous system? It makes sense that it would be the software vendor. This is the view taken by IDx, and articulated by their founder, Michael Abramoff, MD, PhD, in this Washington Post interview. When the question was raised about who’s ultimately responsible for when the AI gets something wrong, he noted, “The AI is responsible, and therefore the company is responsible. […] We have malpractice insurance, AI is not perfect, doctors are not perfect […] it will make errors. (interview, 15:00–16:00). I think the interesting thing here is that when one has a baseline for sensitivity and specificity, you can determine if the model is making more or fewer mistakes than a person would. It leads into the next topic, what are the benefits of applying a model instead of human evaluations anyway?

Choosing What To Be Right About

Clinicians are not explicitly targeting a True Positive Rate or False Positive Rate as they make diagnoses. They are doing their best at both. They ultimately do end up generating TPR and FPR metrics, of course. The curious thing about AI models, however, is that you get to choose which rate to maximize. If you played with the ROC charting tool above, you can see the trade-offs as you move along the false positive rate from 0 to 1.

In a scenario where it is okay to have a lot of false positives, or in which a true positive rate is the most crucial thing, you can pick the probability threshold to use to separate the classes. Here’s the link to the interactive ROC chart again to explore the tradeoffs, link.

The autonomous diagnosis of medical conditions is in its early days, and over the coming years, we can expect true positive rates will get better and false positive rates will decline as additional data is generated and algorithms improve.

Written by Trevor Z Hallstein, Healthcare Product Lead at Swish Labs