Developing a Measured Approach for the Use of AI in Healthcare

Published in

Human-Machine Collaboration

11 min readSep 30, 2019

The potential applications of artificial intelligence — and in particular, machine learning — within healthcare have generated significant interest in recent years. However, although technological developments have rendered it comparatively easy to create and implement neural networks and other AI tools for medical use, healthcare remains a complex arena, where many of the most pressing problems are multifaceted. Additionally, the types of high-stakes decisions which are common in medicine can mean that poorly implemented AI comes at a real cost to human well-being.

The broad adoption of electronic healthcare records (EHRs), along with the development of wearable devices like Apple watches and the rise of direct-to-consumer genetic testing, has led to an unprecedented amount of data on patients, largely available in digital format. However, what remains to be devised is a well-developed plan for using that data to improve and augment existing healthcare delivery and research. Furthermore, having more data is not the same as having better data, and issues with data quality continue to impact both research and clinical medicine. The amount of data doctors are expected to interact with, synthesize, and respond to is itself a complicating factor in the delivery of medical care.

As Atul Gawande wrote in the New Yorker in 2018, “The story of modern medicine is the story of our human struggle with complexity.” While the causative agents for some diseases are clear-cut and well-established, outcomes in medicine frequently result from more than one potential cause, or from the interaction of a number of different, potentially interrelated factors. Much of medicine still relies deeply on correlation — the association of a particular behavior with increased risk of a particular outcome, for example — rather than on absolute causal proof. As a result, when implementing new technology in medicine, especially for diagnostics, there is a tendency to gravitate towards techniques that also rely on correlation.

Black box models are a poor match for medicine.

In the past few years, the potential uses of machine learning within medicine, especially for diagnostic applications, have received particular attention. Recent technological advancements have made it relatively easy to train and deploy neural networks, even without an understanding of how they work. However, their ease of implementation doesn’t mean that neural networks are the best approach, or the appropriate tool for answering questions in medicine. Researchers and ethicists have questioned the use of such “black-box” models in high-stakes decisions. Notably, Dr. Cynthia Rudin has pointed out that even building an accompanying explainable model alongside a black box approach is not an adequate replacement for the use of models that are inherently interpretable.

The risks uninterpretable models pose was demonstrated recently by the performance of CheXNet — a deep neural network constructed by a team of researchers at Stanford. The model was trained on a set of X-ray images released by the NIH in 2017 to detect pneumonia, and it seemed to work. However, upon examination, it became clear that the model was considering information from the edges of images, not just from the medically relevant areas that showed patients’ lungs. The algorithm was more likely to classify an image as showing pneumonia when it detected that the image came from a portable X-ray machine — a result that had to do with trends in the data used to train the model. Because the algorithm was simply looking for patterns, and pneumonia patients in the NIH dataset were more likely to be in the hospital (and thus more likely to get an X-ray in their room, from a portable machine), it incorrectly associated information that revealed the source of the X-ray with the likelihood that the lungs in an X-ray showed pneumonia.

Currently, models like CheXNet are ubiquitous in research literature, but are not generally being deployed in clinical practice. Issues resulting from uninterpretable models are one reason, but there are other hindrances as well. In addition to problems of transferability, where models that work well in theory don’t necessarily operate well in reality, medicine and healthcare are complex fields. One challenge to implementing artificial intelligence (and especially machine learning) approaches is the rarity of well-defined questions. Medicine is often complicated, full of variability and nuance that can make it difficult to determine the best metric to use.

More data isn’t necessarily better data.

Even in cases where questions are very well-defined, however, issues with data quality significantly impede the practicality of relying on computational models. In medical research, problems with data constitute a serious barrier to meaningful use of machine learning techniques. For example, much medical research relies on findings from experiments performed on cultured cells. However, up to 10 or even 20% of in vitro cell lines are estimated to be contaminated — frequently by HeLa cells, which originated as a sample of cervical cancer cells taken without consent from Henrietta Lacks in the early 1950s.

HeLa cells were the first documented immortal cell line, and in the decades since they began to be used in medical research, they have so adapted to laboratory settings that they easily overgrow cultures when accidentally introduced. This means that, for example, experiments intended for pancreatic or liver cells may instead be performed on cervical cancer cells. Though overgrowth of HeLa cells, and the frequency of cell line contamination, have been documented for decades, laboratories continue to publish results based on mislabeled cells.

The issues with data in clinical practice are even more complex. Although widespread adoption of EHRs means there is now a tremendous amount of patient data in digital form, much of that data is of poor quality. While EHRs were touted as the end of errors in medical records, they have not eliminated error so much as they have transformed it. With over 90% of hospitals and providers now using EHRs, patients no longer need to worry, for the most part, about receiving the wrong dosage or form of medication due to a prescriber’s illegible handwriting; but an incorrect selection from a lengthy drop-down menu may lead to the same outcome.

Furthermore, because EHRs were, in many cases, designed with the needs and desires of insurers and hospitals, rather than individual doctors, as a priority, they store some data, such as ICD codes used for billing, more easily than other data, such as complex notes from a clinic visit. Because of various practices, including upcoding (selecting a more serious diagnostic code than is strictly necessary, in order to increase revenue), ICD codes are not necessarily well-matched to a patient’s symptoms.

EHRs are also not generally interoperable; there are hundreds of vendors who create and distribute such products to hospitals and individual medical providers, and few of the systems are designed in ways that allow them to communicate with one another. Additional customizations for specific facilities increase the problem, and mean that switching from one provider to another can require a patient’s records to be printed and faxed. As a result, patient data is often incomplete, fragmented, and siloed.

Systemic problems with healthcare create additional obstacles

In the United States, in particular, systemic aspects of healthcare delivery exacerbate issues of data quality. For example, until the passage of the Affordable Care Act, many people were unable to obtain health insurance, and as a result, could not access health care. Since the grounds for excluding people from purchasing a health insurance plan included pre-existing medical conditions, certain groups of people — particularly those with ongoing or previous health problems, and those who did not have a job that provided healthcare benefits — were much less likely to be included in healthcare datasets. Because the United States still does not have universal healthcare, this is an ongoing issue in spite of the decrease in the uninsured rate since the passage of the ACA, especially in states that did not act to expand Medicare. These kinds of issues in sampling can lead to serious problems in model performance and accuracy, most notably when they are applied to populations that differ from that of the training data.

An additional healthcare issue, particularly in the United States, is physician burnout. Although EHRs were supposed to make doctors’ work easier, they have ultimately created new types of work. Doctors now typically spend more time filling out EHRs — often after hours, without pay — than they do seeing patients. Although burnout is a significant problem across fields of medicine, it is especially common among doctors who spend more of their day interacting with a computer.

Beyond questions of methodology and data, unique considerations are necessary to implement AI in healthcare

Assuming, for a moment, that all of these problems were solved, there are still particular issues that must be considered when implementing AI within healthcare. As alluded to previously in discussing CheXNet, it is imperative that models work as intended. Specifically, care must be taken to ensure that training transfers appropriately, so that what works in theory applies to reality.

Additionally, care must be given to defining benchmarks. For example, if a human doctor can correctly differentiate between a normal and an abnormal imaging result half the time, does an algorithm merely need to outperform the human? If this means performing only slightly better than chance, is that enough to justify the time, expense, and potential consequences of failure that come with the use of AI?

More specifically, the trade-offs that exist in AI (and especially machine learning) can have profound effects when applied to healthcare. When an algorithm makes an incorrect prediction about someone’s likely enjoyment of a movie, or classifies someone as likely to purchase additional clothing from a vendor when in reality they will not, the consequences are limited. However, the impact of an incorrect prediction or classification within healthcare can be profound — literally a matter of life and death, even.

Nonetheless, it’s not possible to minimize false-negatives and false-positives simultaneously. In a medical context, this means that a test can be very sensitive (correctly identifying everyone with a specific disease, for example), or very specific (only identifying people who actually have the disease in question), but not both. In medicine, false-positives and false-negatives both come with consequences. If a test (or a model) classifies someone as having a disease when they don’t, they may receive treatment (with attendant risks and side-effects) that they don’t need. However, if a model fails to catch all the cases when someone does have the disease in question, patients may not receive treatment that they need. Unfortunately, there is no universal answer as to whether it is better to focus on sensitivity or specificity; which one is more important depends on the reason for the test, the effects of the treatment, the risks of forgoing it, and so on.

Finally, a lack of contextual knowledge, especially of clinical practice, remains a significant barrier to developing applications of AI for clinical practice. While much of medicine can be represented in terms of test results, long-term survival rates, and other quantitative data, the value of other parts of clinical practice are harder to quantify. The role of human interaction — of a doctor’s rapport with patients, for example — is hard to measure and to account for, yet research has demonstrated that it can impact patient outcomes. The discounting of the role of interpersonal connection within medicine is, perhaps, symptomatic of a bigger problem in the tech industry — a tendency to ignore or devalue things that are hard to quantify.

In light of the issues, what areas of healthcare present good opportunities for using AI?

All these difficulties aside, there are at least two use cases within medicine where AI appears to be a promising tool. The first is in “data heavy” scenarios — situations where there is a large amount of information that can be used to generate predictions. This is especially promising within domains that humans tend to struggle with, such as accurately sorting normal versus abnormal imaging results. Part of what is promising about this application of AI is that it has the potential to free up doctors to play more of a role in interpreting and translating results so that patients can better understand them. In his recent book, Deep Medicine, Dr. Eric Topol considers how such developments might allow for shifts in the responsibilities currently associated with certain subspecialties of medicine. For example, much of the work that radiologists currently do involves looking at imaging; if that burden is shifted to an algorithm in the future, then perhaps the role of radiologists will change to something more patient-focused.

A second use case where AI has shown potential for improving healthcare is in “functional” scenarios — areas where automation or AI can reduce the time doctors and other medical staff spend on repetitive, non-critical tasks such as charting. There have been efforts to reduce the amount of time doctors must spend inputting medical information into EHRs by assigning them scribes who can type notes during appointments. However, logistical and privacy issues mean that this is not the most feasible (nor the most cost effective) solution. The development of effective natural language processing, which could easily and accurately transcribe notes from a doctor’s conversation with a patient during an appointment, would not only allow for better data within medical records, but could also allow doctors the chance to develop deeper, more human relationships with their patients, and could reduce physician burnout.

One example of the use of AI to serve functional needs is the creation of Moxi, a medical delivery robot. Moxi was created to support nursing staff, and the development of the robot included extensive interviewing and shadowing of nurses on their clinical rounds. Moxi’s creators also considered how to make the robot approachable and non-threatening for patients.

As in many areas where AI is being tested and developed, there are real problems in healthcare that may potentially be solved using AI. However, medicine is unique in the high-stakes nature of the decisions it involves. Consequently, the use of AI within healthcare requires thoughtful consideration of the potential impact models may have on patients, physicians, and other healthcare workers. Furthermore, the complex nature of healthcare systems, particularly in the U.S., means that it is vital that those who seek to develop models for use in medical settings engage with people who have first-hand experience of healthcare, whether as a patient, provider, or researcher.

Finally, as in other fields, perhaps the most promising aspect of AI’s application to healthcare is its potential to free up human time and energy for use on tasks humans excel at, such as building relationships with patients, responding to novel situations, and contextualizing information. By using technology to support doctors, nurses, and other medical staff, and by engaging them in discussions leading to its creation, there is potential to make healthcare work more effectively — and more humanely — for more people.

About the Human-Machine Collaboration Publication and the Berkeley AI Meetup

Preparing and equipping humans to work and live with machines is far easier when creating those machines involves thoughtful consideration of human abilities and human needs. Given our interest in these issues, Bob Stark and Ian Moura decided to create a discussion group for the purpose of research and problem-solving through the Berkeley AI meetup group. This Medium publication summarizes the background information that we cover in our meetings.

References and Further Reading

Ferguson, John Wayne. Medical Delivery Robot Moxi Being Tested at Texas Hospital. (2019) https://www.washingtontimes.com/news/2019/jan/25/medical-delivery-robot-moxi-being-tested-at-texas-/

Gawande, Atul. (2018, November 12). Why Doctors Hate Their Computers. The New Yorker. https://www.newyorker.com/magazine/2018/11/12/why-doctors-hate-their-computers

Gold, Michael. (1985). A Conspiracy of Cells. New York: SUNY Press.

Harris, R. How Can Doctors Be Sure A Self-Taught Computer Is Making The Right Diagnosis? (2019) https://www.npr.org/sections/health-shots/2019/04/01/708085617/how-can-doctors-be-sure-a-self-taught-computer-is-making-the-right-diagnosis

Panch, T., H. Matties, & L.A. Celi. The “inconvenient truth” about AI in healthcare. (2019) https://www.nature.com/articles/s41746-019-0155-4.pdf

Rashidi, H.H., N.K. Tran, E.V. Betts, L.P. Howell, R. Green. Artificial Intelligence and Machine Learning in Pathology: The Present Landscape of Supervised Methods. (2019) https://journals.sagepub.com/doi/pdf/10.1177/2374289519873088

Rudin, C. (2018) Please Stop Explaining Black Box Models for High-Stakes Decisions. https://arxiv.org/pdf/1811.10154.pdf

Shulte, F. & E. Fry. (2019). Death by a Thousand Clicks: Where Electronic Health Records Went Wrong. https://khn.org/news/death-by-a-thousand-clicks/

Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. (2019) https://www.gwern.net/docs/ai/2019-topol.pdf

Topol, E.J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. (2019). New York: Basic Books.

Wiens, J., S. Saria, M. Sendak, M. Ghassemi, V.X. Liu, F. Doshi-Velez, K. Jung, K. Heller, D. Kale, M. Saeed, P.N. Ossorio, S. Thadaney-Israni, & A. Goldenberg. Do no harm: a roadmap for responsible machine learning for health care. (2019) https://www.nature.com/articles/s41591-019-0548-6.pdf