Getting the right answer, probably

The Only Probability You Need to Know

Data Science (DS) and Diagnostic Testing (DT)

Stephen Ruberg

--

OK. Here is the spoiler. It’s positive predictive value (PPV).

Introduction

As I mentioned in an earlier article (Analytics, Data Science and Statistics), when I was leading Advanced Analytics at a pharmaceutical company, I would get occasional phone calls from headhunters who were looking for someone to lead a chemistry function or (bio)analytical group. For most of the world and most of time, including the present day, analytical scientists were those who perform laboratory (bio)chemical assays to obtain an accurate estimate of the true contents of a sample (e.g. fluid or tissue). Hence, I have advocated for the use of the term Data Analytical Sciences or Scientists to clearly distinguish our field from the vastly different realm in which we live relative to (bio)analytical scientists. But do we?!

Thanks to National Cancer Institute for sharing their work on Unsplash.

In this article, I will describe how Data Analytical Scientists can learn a lot from (bio)Analytical Scientists. After all, both disciplines deal with samples from which they want to extract the true contents.

Diagnostic Testing

Diagnostic Testing (DT) has been around for a very long time. People/patients have always wanted to know important questions about their health or prognosis. Do I have the disease? Will I get better? And now in these modern times: Do I have the gene or other biomarker? Do I have cancer? Will I progress to kidney failure? Certainly, AI and ML have jumped into this space in a very big way — scanning individual patient images to find tumors that humans can’t see or can easily miss; or using neural nets to build models on vast arrays of data to predict some medical outcome for an individual patient and so on.

One problem I see quite often is that Data Scientists are using DS methods, but do not fully realize that they are actually doing DT. Perhaps such Data Scientists merely think of DT as those things you do when you go to the doctor, and they take a blood sample and then send it off to a lab where someone does a (bio)chemical assay that sends back a result that is positive or negative. While that is certainly the 20th century history of DT, the 21st century version of DT now includes the Data Scientist as the laboratory analyst and the computer algorithm as the assay machinery.

Allow me to explain this direct analogy. Well, it’s not really an analogy; it is reality. First, I will explain the principles of DT. They are simple but fundamental to so many problems related to uncertainty — decision-making, statistical hypothesis testing, interpretation of results and ultimately understanding what is likely to be true (remember this is the fundamental role of statistical science).

The Merriam-Webster definition of a diagnostic is “serving to distinguish or identify a diagnostic feature.” In other words, finding out if a patient does or does not have some feature. In the most general sense, that feature could be a present state (e.g. are you pregnant?) or could be predicting a future state from your present state (e.g. at present, do you have the features of a patient who will progress to a heart attack or acute kidney injury). In this way, DT incorporates many elements of “predictive analytics.”

When viewed from this general perspective, many activities of DS fall into the realm of DT, though I suspect that often this is unrecognized or at least under-appreciated.

DTs are often based on a biochemical assay. Such assays and the processes to produce a diagnostic result for an individual patient are the same as many DS applications. For a DT, a sample (e.g. blood, urine, sputum, tissue) is processed in a very defined way — e.g. separate the red blood cells from the serum; add a pinch of this chemical or that. Take the resulting fluid and run it through a set of complex machinery — e.g. liquid chromatography (LC) or mass spectroscopy (MS) or both — that performs sophisticated manipulations of the sample fluid. That machinery has a variety of settings (e.g. pressure, flow rate, collision energy, etc.) which are fine-tuned using test samples (i.e. samples with known contents). The whole system is validated using a final set of samples with known contents and a calibration curve is produced relating the assay output (e.g. the peak height on a chromatogram) to the contents of the sample. In the final step of the diagnosis, the contents of the sample are assessed versus a cut-off value (threshold) to determine whether the patient has the feature of interest or not. That is, if you have a certain level of a protein circulating in your plasma (or urine or whatever fluid or tissue sample) and it is above the threshold, then you are positive for the feature; if below the threshold, you are negative for the feature. For example, if human chorionic gonadotrophin exceeds 50 milli-International Units per milliliter of urine in a urine sample from the patient, then it is predicted with 99% probability that she is pregnant.

Many so-called ML/AI applications in healthcare are simply diagnostics tests. The sample is merely data — clinical data, digital data in the form of pixels or waveforms. The “black box” chemical assay is replaced by an algorithm, and computing equipment substitutes for the assay machinery. Test samples and the validation/calibration process for the assay are conceptually identical for building, training, and validating the algorithm. The output of the algorithm is some quantified measure that is compared to a threshold for declaring whether the feature (e.g. tumor, retinopathy, COVID-19) is present or not.

  • Side Note 1: In a general sense, there are DTs that are not (bio)chemical assays, but are “machine based,” for example, electrocardiograms. It is the same thought process. The patient is “processed” in a certain way (i.e. lying down), and the assay set up is consistent (the 12 leads are placed around the chest in a precise way). A complex, validated machine takes measurements and produces an output. That quantified output is compared to some standard or threshold to produce a diagnosis for the individual patient. For example, when the QRS complex is greater than 120 milliseconds and downwardly deflected in lead V1, then the patient is diagnosed with a left bundle branch block in their heart.

Developing a Diagnostic Test (DT)

Perhaps by now it is apparent that determining the threshold for which a measurement is declared positive or negative, indicating the feature is present or absent, is a crucial part of the DT design. Sometimes this is referred to calibration. The calibration of a DT is (or at least should be) governed by how decisions will be made about the patient and the subsequent care they receive. A woman who learns she is pregnant may immediately forego drinking alcoholic beverages and seek an appointment with a doctor. Before we jump to calibration, let’s examine all the pieces of development that lead up to it.

To understand how to optimize decisions made based on a DT result, let’s examine the development, operating characteristics, and interpretation of a DT in general. These have been well-known for decades (if not centuries) and are clearly depicted in the familiar schema shown below, but they appear to be not well understood by some physicians and many patients, and certainly overlooked too often in DS work.

A schema for describing the operating characteristics for a diagnostic test (DT).

The DT is designed to have suitable sensitivity and specificity. These are common terms. Sensitivity is determined by taking a large sample of patients who are known to have the feature of interest (i.e. control samples) and running them through the diagnostic test. The percentage of positive patients identified is the sensitivity. It may be written as

Pr( DT result is positive | the patient has the feature).

For those not familiar with probability notation, the vertical bar in the statement is translated as “given” or “assuming.”

Specificity is the converse. Take a large sample of patients who are known not to have the feature, and the percentage of them who show up negative from the DT is the specificity, written as

Pr( DT result is negative | patient does not have the feature).

A false positive (FP) finding (the DT declaring that the patient has the feature when in truth they do not) is simply

FP = 1 — specificity.

Similarly, a false negative (FN) finding (the DT declaring that the patient does not have the feature when in truth they do) is

FN = 1 — sensitivity.

Of course, a DT having high sensitivity and specificity is a good thing. Sometimes the accuracy of a DT is quoted (i.e. the probability that the DT gets the right answer — a true positive or a true negative) and is the weighted average of sensitivity and specificity where the weight is the prevalence of the feature in the population being tested.

Now that brings us to prevalence. Prevalence is the proportion of patients in the population of interest that has the feature of interest. So, it can be thought of as the probability that any given patient taken at random from the population has the feature. For example, the prevalence of a gene, call it the XYZ gene, that is important for some cancer treatments might be 5%. Let’s use this as an example to put these ideas to work, though it has its basis in reality.

Suppose there are 2000 patients who may be candidates for the personalized cancer treatment based on the XYZ gene (I use 2000 to make the arithmetic easier and abundantly clear below). It is important to know the patient’s XYZ status — does the patient have the gene or not? This is critical to the decision about which treatment to use for that individual patient. For ease of exposition, suppose the DT for the XYZ gene has 95% sensitivity and 95% specificity. This would be considered an exceptionally good diagnostic test. For example, the current COVID-19 test that is in wide use can have a 20–30% false negative rate (at least according to my doctor and some other reports). OK, armed with this basic information we can construct the table for this DT.

The 2x2 Table for displaying the operating characteristics of the diagnostic test (DT) for the Gene XYZ example.

If there are 2000 patients and 5% have the gene, then there are 100 patients that are in the column of the table labelled “Present.” There are 1900 patients in the “Absent” column. With 95% sensitivity, that means that 95 of the 100 patients in the “Present” column will have a positive test results — i.e. a true positive — and 5 patients in that column will have a negative test result — i.e. a false negative. With 95% specificity, that means that 1805 of the 1900 patients in the “Absent” column will have a negative test result — i.e. a true negative — and 95 patients will have a positive test result — i.e. a false positive.

What a physician and a patient really want to know is the answer to the following question: if the diagnostic test result is positive, what is the probability that I have the XYZ gene?

What a physician and a patient really want to know is the answer to the following question: if the DT result is positive, what is the probability that I have the XYZ gene and should get the novel/special treatment for my cancer? When written as a probability statement, it looks like this:

Pr (the patient has the feature | the DT is positive).

This is the inverse probability of sensitivity stated above. That is, if sensitivity is pr (A|B), then what we really want to know is pr(B|A). This latter probability is called the Positive Predictive Value (PPV) and is computed using the “DT Result Positive” row of the table. It is the percentage of true positives (TP) divided by all test results that are positive — TP and false positives (FP):

PPV = TP / (TP + FP).

In this example, that calculation is PPV = 95 / (95 + 95) = 50% !

That’s right, even with an excellent DT (95% sensitivity and 95% specificity), the probability of getting the right answer when there is a positive DT result is no different than the flip of a coin! How can that be so?! Let’s look at another example.

Suppose we have the same set-up, but the feature ABC is of interest and has a prevalence of 50%. Reproducing the table in the same way, we find PPV = 950 / (950 + 50) = 95%.

The 2x2 Table for the operating characteristics of the diagnostic test (DT) for the Feature ABC example.

The same thinking goes into the Negative Predictive Value, which is also important. If a patient has a DT that comes back negative, how certain are they that the patient truly does not have the feature (e.g. COVID-19)? In the first example with the XYZ gene, the NPV is 99.7%. So, if that DT is negative, then the physician and the patient can be very assured that the patient is not eligible for the novel, personalized medicine. In the second example of Feature ABC, the NPV is 95%, still quite high.

The key message here is that the PPV is related to the prevalence (again, a well-known fact in the DT world for decades). The lower the prevalence, the lower the PPV. Said differently …

The key message is … when you are searching for something rare (a needle in the haystack, a predictive feature in a ‘big data’ dataset), be very careful about interpreting a positive finding (for your model, for your statistical test, your prediction or your diagnosis). It can be quite likely that the “finding” is a false positive — more likely than your intuition might tell you.

Real-Life Example

An example will help to illustrate some very real difficulties when DS is applied to DT, and I hope to share some learnings or suggestions. Let’s take the prediction of acute kidney injury (AKI). The objective is to take as much data as possible about a patient’s current state and predict whether they will progress to AKI in the coming few days. This is especially important since AKI can progress rapidly and be fatal if not treated aggressively. Furthermore, treating AKI early can avert deleterious downstream consequences. So, in this case, the DT is a prediction algorithm, and the feature is AKI. I will use a publication on this topic to share some practical details.

The example comes from an article entitled “A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury” published in the prestigious journal Nature by a large team of researchers primarily from DeepMind and University College London (Tomašev et al, Nature. 2019 Aug; 572(7767): 116–119.). Here are a few brief details to set the stage along with my commentary.

First, the algorithmic approach to the problem is “a recurrent neural network that operates sequentially over individual electronic health records” (EHR). Furthermore, the authors note, “This model was trained using data that were curated from a multi-site retrospective dataset of 703,782 adult patients from all available sites at the US Department of Veterans Affairs.” This was a large EHR database — “The total number of independent entries in the dataset was approximately 6 billion, including 620,000 features.” Now, here is where some comments are warranted.

  • First, I suspect there were not 6 billion independent entries since there were only 703,782 patients in the database. Certainly, there were some repeated measures or other correlated measures within patients in the database. This can be important in modeling or creating predictions — understanding which observations are truly independent and provide distinct information versus which are dependent (i.e. correlated).
  • Second, and perhaps more importantly, I agree with Nate Silver about being cautious with ‘big data’ (see The Signal and The Noise: Why so many predictions fail; and some don’t). “More Information, More Problems” is the title of the very first section of his landmark book. Spurious relationships and over-fitting can lead to dangerously wrong predictions.
  • Lastly, of course, the VA database is not very representative of the general US population, which the authors acknowledge.

The model building process followed a familiar path — “Patients were randomized across training (80%), validation (5%), calibration (5%) and test (10%) sets.” A longer treatise is needed to describe why this is not a fool proof-approach, but a quick read on the topic was covered in a separate article, but let’s move on to the main points to be made for this article.

There are many results presented in the Nature paper about model fitting, calibration and prediction accuracy, but I teased apart some pieces of the article to assess it in the context of traditional DT.

  • First the prevalence of AKI in this study was 13.4% (“the incidence of KDIGO AKI was 13.4% of admissions”) where KDIGO is an internationally accepted standard for diagnosing or classifying AKI patients.
  • Second, this “diagnostic test” uses a probability of future AKI as the assay quantity of interest. “The model outputs a probability of AKI occurring at any stage of severity within the next 48 h. When the predicted probability exceeds a specified operating-point threshold, the prediction is considered positive.” With a positive signal defined by a probability above the threshold, an alert is created for the physician to act (a positive diagnostic test).
  • Third, the authors quote that there is a “ratio of 2 false alerts for every true alert.” This means that the Positive Predictive Value (PPV) is (true positives) / (true positives + false positives) = 1/(1+2) = 33%. That’s very low, but not surprising as we have learned that low prevalence leads to low PPV.
  • Fourth, in numerous parts of the article and its “Extended Information” online, the authors quote various levels of sensitivity and specificity as well as Area Under the Receiver Operating Characteristic curve (ROC AUC). As noted in this article, sensitivity and specificity are important in designing a diagnostic test, but it is the PPV that is most relevant to the physician and/or patient. As for the ROC AUC, again that is the subject of another article, and in my opinion, while it may have a role in designing a diagnostic test, it does not have a lot of meaning for interpreting an individual diagnostic test result (in this case a prediction of impending AKI).
  • Fifth, from all this information, including that in Extended Table 2a, I have constructed the following diagnostic testing table as described earlier in this article. This format allows a clear and concise understanding of how this AKI algorithmic DT works.
The 2x2 display of the operating characteristics of the AKI algorithmic diagnostic test (DT).

The sensitivity of the diagnostic prediction is Se = 0.85 / 13.4 = 6.3%. So, overall there are very few predictions of patients progressing to AKI given the size of the AKI population in the VA database. The specificity is Sp = 84.71/86.6 = 97.8%, which is exceptionally good. Again, the KEY elements for measuring the performance of a diagnostic test, whether by a biochemical assay or an algorithmic assay is its PPV and NPV. With a PPV of approximately 1/3rd as noted before and explicitly described by the authors, this is very low. On the other hand, the NPV is appears to be suitably high at 87%.

Is the PPV unacceptably low or the NPV acceptably high? That requires understanding a utility function, that is the cost (benefit) of a false (true) positive finding relative to the cost (benefit) of a false (true) negative finding. Let’s explore these concepts.

The low PPV means that if this algorithm predicts you (an individual patient) will progress to AKI in the next 48 hours, there is only a 1/3rd chance that you actually will. Yet, a physician or care team at the hospital will be given an electronic alert, and they are likely to act by taking aggressive measures to treat you as if you are on the path to AKI in order to avert that possible eventuality. There is a clear benefit for the 1/3rd of patients who get such treatment hopefully averting dire consequences. I do not know what the negative consequences are — the duration of ICU stay, the probability of dying — and a lot depends such quantitative information. The cost is that 2/3rds of the alerts result in unnecessary treatment for patients, potentially causing harm and certainly increasing costs to the healthcare system. Maybe that is a reasonable trade-off — 2 wrong decisions for every correct decision — for such a life-threatening disease. Such considerations are not addressed in the paper, but are CRITICAL to designing the diagnostic test; that is, what is the right cut-off (in this case probability) for fine-tuning sensitivity and specificity in order to obtain the optimal PPV based on benefits and risks to the individual patient.

Similar considerations apply to NPV. Of note, 12.5% of patients have false negative findings. That is, the algorithm fails to identify these patients who are truly progressing to AKI. This is perhaps of greatest concern and where the value of a machine learning (ML) algorithm could be most helpful. An algorithm that consumes vast quantities of data and can identify what humans may miss seems to be one of the key value propositions of ML in healthcare. This may be an unacceptably high “miss rate,” but others more versed in the disease progression and healthcare economics are needed to help make such judgments.

  • Side Note 2: These are precisely the evaluations done by U.S. Preventive Services Task Force (USPSTF) when evaluating breast cancer mammography screening (a technological DT) in terms of age of onset and frequency of testing (https://www.acpjournals.org/doi/10.7326/M15-2886). Very recent work has emerged in the use of AI for aiding radiologists in the interpretation of mammographic images (https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30003-0/fulltext). In this latter article, the primary focus was on sensitivity, specificity and ROC AUC with no mention of optimizing the cut-off for PPV or NPV and the cost benefits of such considerations.

The last question that is also central to this or any algorithmic diagnostic test assessment is: “Does this diagnostic test system do any better than existing systems?” Those existing systems could be based on quantitative laboratory measurements or on the knowledge and experience of the physicians and nurses in the ICU. Certainly, ICU physicians and nurses use their knowledge and experience to evaluate measurable characteristics of the patients to decide whether they need to intervene more aggressively or not to avert a potential AKI event. Perhaps their intuitive or experiential “system” and judgments have better PPV and NPV, albeit not necessarily measured formally. One way to explore this is to do a formal test. For example, one hospital could implement the algorithmic diagnostic test, and another similar hospital could proceed with their current approaches, whether it be existing quantitative approaches or such quantitative measures in combination with physician judgment.

This is exactly what was done and reported in the publication “Evaluation of a digitally-enabled care pathway for acute kidney injury management in hospital emergency admissions” by Connell et al from University College London and DeepMind along with others (https://www.nature.com/articles/s41746-019-0100-6). In that publication and a resulting commentary in the British Medical Journal (https://www.bmj.com/content/366/bmj.l5011.long), the high-level result was that the algorithmic diagnostic test for predicting AKI did no better (maybe a little worse) than the existing “system” in the hospital that did not use the algorithm.

Summary

DS holds the promise to help humans do a better job of identifying important health features of the human condition. Essentially, these approaches are indeed DT at their core, even if it is most often not recognized as such. The traditional DT design and interpretation paradigm can and should serve as a roadmap for algorithmic DT development and interpretation. It would behoove the DS world to embrace the existing paradigm for DT and report on its essential measure(s) — PPV and NPV — perhaps display the operating characteristics in the familiar and recognizable tabular format presented herein. Any proposed DS/algorithmic-based DT should be compared against existing methods using these important quantities to understand if they provide any improvement and value. Also, knowing these quantities in conjunction with the costs of incorrect decisions and the benefits of correct decisions allows the fine-tuning of the DT — be it bioanalytical or algorithmic — to achieve optimal healthcare results.

--

--

Stephen Ruberg

Steve Ruberg is the President of AnalytixThinking, LLC. He formerly spent 38 years in the pharma industry as a statistician and leader of advanced analytics.