Why Clinicians Don’t Trust Your ML/DL Algorithm…
As a physician data scientist and healthcare administer, one of the frequent complaints I hear from other data scientists is that it is difficult to get clinicians to accept the validity of their “new prediction tool”. While I personally feel that the the perception of the clinical community is shifting towards embracing big data and predictive analytics, I am also acutely aware that there is indeed an environment of mistrust between clinicians, administrators, and data analysts/scientists. What steps can we take to change these perceptions and shift towards an environment of collaboration? Here are my thoughts.
1) Clinicians and Data Scientists speak different languages.
Over 77% of clinicians have undergraduate degrees in either the biological sciences, premed, or another physical science (see US Bureau of Labor link below). Further, while most have participated in some type of bench research, they tend to focus on the biochemical aspects of the research more than statistical and mathematical modeling. There are certainly exceptions and finding a MD/PhD, clinical nurse informaticist, or clinician with a MPH in field is not necessarily difficult. Such clinicians bridge a critical gap between the academic and operational medical communities. That said, while one or two classes of calculus are a requirement for medical school admission, statistics is not.
From premed to physician: Pursuing a medical career : Career Outlook: U.S. Bureau of Labor…
Data offer insight into becoming a medical doctor: what physicians and surgeons studied, where they work, what their…
Most clinicians obtain their first introduction to statistics during their academic years of medical school. Gradually, we gather more detailed exposure throughout clinical rotations, residency, and pragmatic analysis of journal articles. I still remember sitting in journal rounds, where every resident either attempted to talk around the statistics section of the article, or just admitted that they had no idea about the validity of the statistics decided upon.
One is left to question if clinicians even care about statistics? YES! In fact, physicians literally make life and death choices based on these statistics. That said, clinicians are taught to rapidly amalgamate large amounts of information in order to derive practical applications to care for a patient. They simply don’t have time while trying to learn their craft to worry about the details behind regressions, random forests, support-vector machines, and deep neural networks. Most clinicians focus entirely on p-scores, LR’s, CI’s, sensitivity, specificity, NPV, and PPV.
Further, most data scientists come from a computer science or mathematics background and have little, or no, experience in clinical medicine. This is a critical point to keep in mind as we attempt to convince the clinical community that we are capable of developing data leveraging answers to solve complex clinical problems.
2) Data Scientists Need To Think In Terms of “Gold Standards” and Clinical Decision Making Structures…
In the clinical world, a “Gold Standard” test is a test that defines a condition. As an example, in the clinical world, the gold standard test for Influenza A is a viral culture. This culture has a specificity of 100% because by definition it defines the diagnosis of influenza A. There can be no false positives with a confirmatory gold standard test. Likewise, a gold standard screening test would have a sensitivity of 100%, meaning that false negatives can not exist. If the scientific community develops a test that supersedes the gold standard then the definition of the diagnosis changes. Wikipedia has a good discussion on this topic if you want to dive in further.
Gold standard (test) - Wikipedia
When the gold standard is not a perfect one, its sensitivity and specificity must be calibrated against more accurate…
To complicate things a little, we often speak of gold standard tests when they are only the gold standard in terms of practicality. An example would be in the diagnosis of a stroke. The gold standard for clinical diagnosis of a hemorrhagic stroke would be a CT scan obtained within the first three hours of suspected diagnosis. In this case the clinical gold standard does not match the pathological gold standard which is a postmortem autopsy (most patients try to avoid this test!)
As data scientists, if we don’t acknowledge and speak with the clinical community in terms of gold standard tests when discussing new algorithms then we appear as outsiders with little understanding of how the clinical community functions. In fact, every physical exam technique, laboratory test, radiographic test, and treatment is compared against gold standards for specificity, sensitivity, NPV, PPV, LR’s, or CI’s. When we report out an F2, AUC, Accuracy, R2, or MSE without speaking in terms of gold standards and other clinically useful alternatives we ignore the context in which clinical decision making takes place.
3) There Is Always A Give and Take Between Sensitivity and Specificity…
Finally, as a clinician, I am always deciding between screening or confirming for suspected ailments. A typical encounter proceeds as follows:
- Patient presents with symptoms
- Generate a differential diagnosis list
- Determine how to eliminate potential diagnosis on that list (screening tests — sensitivity)
- Verify suspected diagnosis (confirmatory testing — specificity)
- Treat the patient and confirm expected response
Most of the time elimination of a differential diagnosis occurs based on a simple history and physical exam.
As a data scientist we need to recognize and speak to the potential impact that a ML algorithm could have within the context of this workflow; specifically how it could either increase or decrease the pretest probability related to a differential diagnosis and thus increase the positive predictive / negative predictive value of my history / exam / medical testing.
There are many examples within the clinical community of where this has been done well and thus gained approval within clinical workflow. One such example is found in the Centor Strep Pharyngitis Study.
Centor Score (Modified/McIsaac) for Strep Pharyngitis - MDCalc
Steroids and NSAIDS improve symptoms; antibiotics are often indicated in streptococcal pharyngitis, but do not prevent…
This risk scoring algorithm is an excellent example of both the benefits and potential pit falls of data driven algorithms. While not a complex ML algorithm (the study used simple logistic regression for analysis), the study provides an excellent example where simple history and physical exam data could quickly be combined to rule out a diagnosis that had previously required a lab test. The study also reminds us of the limitations of such studies. Initially, the Centor Score did not include age as a risk factor, but after additional studies, it was found that the sensitivity of the criteria was significantly lower in children. This was especially concerning given that rheumatic fever, the primary sequela of untreated strep pharyngitis and precursor of rheumatic heart disease, primarily occurs in younger children. Eventually age was added into the criteria to raise the exclusionary criteria back to a sensitivity of 98%.
To further the point of the importance of clearly communicating in the shared language of the clinical community, the Centor criteria simply communicates its relevance through the sensitivity and specificity of the score based on the number of risk factors identified. With no risk factors identified, the sensitivity of the test is over 98%. With all 5 risk factors present, the specificity is only 52%. Thus, in the setting of a complete lack of risk factors a clinician can safely assume a lack of strep pharyngitis, in the setting of all 5 risk factors, the clinician can only say that there is a 50% chance the patient has strep. This fact alone led the clinical community to stop emperically treating strep pharyngitis without laboratory confirmation.
This has been a very short article on a topic that I have become highly passionate about. It has been over two decades since “To Err Is Human” was published, yet medical errors still plague the healthcare system. I believe the reason behind this lack of progress is within the title of the original report. When we rely on human processes to solve problems derived primarily from human oversight, there is little surprise that progress has become stagnant. Harnessing the power of big data holds the promise of advancing patient safety within healthcare and literally saving lives. There are many obstacles, both technical and administrative that need to be overcome to get there. More than either of these, however, is the need for clinical practitioners and data scientists to become collaborative members working within the same language to improve clinical workflows, advance healthcare quality, and advance the safety of care delivered.