Healthcare is Not Ready for Artificial Intelligence. Yet.

Sidharth Ramesh
The Startup
Published in
8 min readMay 8, 2020

Motivation

Healthcare is a peculiar field. It has made a lot of advancements when it comes to treating patients. New therapies, new surgeries, new drugs. But somehow it seems like breakthrough technologies of our decade like machine learning and cloud computing are yet to penetrate the field.

A lot of progress has already happened in the field. Qure.AI can detect Tuberculosis on Lung X-rays with superhuman accuracy. Google can detect Diabetic retinopathy with superhuman accuracy. While this does sound like AI is going to replace radiologists and ophthalmologists, the fact is far from it. Most of these are laborious tasks that would require verification of a doctor anyway. But these algorithms surely act as a safety net.

Although I would be happy to see widespread adoption of such technology, it is just not the case.

I have no doubt that the future of medicine will be algorithm-driven, but there are these impediments standing in its way today.

The Data Uniformity Problem

For training any algorithm, especially complicated ones that are to actually be of benefit in practice, we need a lot of data. And not just any sort of data. We need carefully assembled structured data. Data generated by healthcare today is on the opposite end of the spectrum. Just take the simple act of prescribing a drug with a particular dose. You’d think that it’s structured enough right? Not at all! To begin with, hospitals that don’t have any e-prescribing system have to rely on the doctors writing a prescription. There are countless jokes about how bad the doctor’s handwritings are. Even for a top-notch handwriting detection AI, giving a doctor’s handwriting is a bit too much.

Now when it comes to electronic health records, that problem is in fact multiplied. Let’s say we want to train an algorithm on detecting X-rays, but with preexisting data recorded by doctors. The same concept, for example, “lung consolidation” is represented by different doctors in different ways. Some may record it as an auscultatory finding — “On examination — lung consolidation”, others may even directly made a diagnosis like Pneumonia. And there are multiple abbreviations and synonyms that are used while actually typing this out. Any intern trying to read a file and having a hard time understanding what on earth is written will relate.

Now it’s not all grim. There are some solutions already thought of by smart people.

SNOMED

Stands for Systematized Nomenclature of Medicine, and it’s a pretty good attempt at solving this problem. Everything is represented as concepts and they also have relationships with other concepts. For example, the concept “Lung consolidation” looks like this:

The relations tell that it has the associated morphology of consolidation in the site lung. The parent attribute also tells us that it’s a disorder of the lung. While it may seem redundant for us, for a computer this information is gold.

While it may seem redundant for us, for a computer this information is gold.

SNOMED includes almost all things medical like Drugs, Substances, Diseases, Organisms are more.

LOINC

Stands for Logical Observation Identifiers Names. Anything that can be measured or assessed, can be represented as a LOINC concept. For example Blood pressure. There’s a LOINC concept for that:

And it pretty much has all the common investigations covered. Now, the answer to “LOINC questions” can either be numbers: like 120/80 mm(Hg) or can be other SNOMED concepts for describing things like organism in the blood culture.

FHIR

Both SNOMED CT and LOINC are good to standardise concepts, but we don’t want someone recording all of this on an excel sheet. So FHIR standing for Fast Health Interoperability Resources solves this problem. It brings a set of standard representations that can be understood by everyone. For example, the condition FHIR resource looks like this in JSON:

It very much compatible with the existing web API infrastructure and is a good way to store and transmit health information.

The Private Training Problem

Now we have structured data, but medical data is sensitive by nature. It is obvious that you wouldn’t want another person knowing that you have a certain disease, especially when it's associated with a lot of social stigmas. In India, even completely treatable diseases like Leprosy and Tuberculosis are looked at with a disproportionate amount of fear and discrimination. Knowing the stigma attached, doctors even rename diseases in front of the patient to make it sound better. HIV is called “retro-positive”; Leprosy renamed “Hansens’ disease” and cancer is only referred to as “malignancy”.

The recent discoveries we have made in the field of molecular biology has given society the ability to rip us apart at the genetic level and subject us to systematic discrimination for conditions we don’t even have. It’s not hard to imagine a dystopian future where your employee does not hire you because you have genes that encourage you to sleep late and wake up late every day, where police systematically monitor and harass people “at a high risk of committing crimes” and health insurance is extremely expensive for people who need it the most.

Coming back to the present, releasing your medical data to the public can have more immediate consequences and even put you in danger. Take for example the market for illegal trafficking of organs that exists in many developing countries. Leaking your HLA type or blood group to such groups could put you in danger of being kidnapped and even killed.

Nagamma, a victim of organ traffickers, with two of her three sons in front of her shack in Chennai. Source

Anonymization can be undone

Let’s say we have researchers who are interested in studying a sudden outbreak of disease X. Hospitals are getting flooded with patients with disease X, but doctors are nowhere closer to figuring out why or how this disease occurs. They are bound by law to keep patient information confidential and cannot just release it to the researchers. However, the hospitals are ready to release the data with only the fields of interest to the researchers with the personally identifying fields removed.

For a researcher who has no idea about the disease, or the consultant who has machine learning in their title, all fields are “fields of interest”. However, they decide that they only need the age, gender, symptoms, duration of symptoms, and the place the patient comes from. Very conservative fields of interest for an infectious disease.

At first glance, this seems to be a decent approach. But notice that many of these fields are commonly released with other datasets. So, by having another public dataset, for examp2le, the voter's list (EPIC numbers can be easily brute-forced to leak details of every voter in India, thanks to apps that are built to help voters) the name, age, address of all the patients can be regenerated.

This attack is called a reidentification attack. And this does not exist only in theory. This has been carried out in practice multiple times. Netflix releases anonymized data on thousands of people with even the name of the movies replaced with tags, and researchers quickly combined it with the IMDB database to reidentify many of them. It happened again when the health records of Governer William Weld was reidentified and again when genetic data of the anonymous participants of the Personal Genome Project was reidentified. There are even machine learning algorithms that reidentify automatically by pulling data from the internet with up to 99.98% accuracy.

Possible Solutions:

There are many methods that try to prevent reidentification attacks. k-anonymity is a technique where at least k people in the dataset have the same attributes for a particular field. This makes reidentification attacks harder to perform, but not impossible. Some others are l-diversity and t-closeness, all slight variations and refinement to the same basic idea: Change the dataset so that the data remains anonymous. Another technique is to generalize the data so that the data points are nudged a little before releasing them. This preserves the macrostatistics of the dataset while preventing reidentification attacks. This is a step in the right direction, however, to really protect the data, the amount of noise that is to be added severely limits the utility of the dataset. Moreover, some fields like strings cannot be generalized easily.

But all of these techniques fail when the attacker has some background knowledge of people in the dataset, which is not a hard feat in the information-rich world we live in.

Differential Privacy

There is only one technique that guarantees privacy, no matter what. It is mathematically rigorous and maintains anonymity even when the attacker is given all the background knowledge in the world, and it’s called Differential Privacy.

Its modern version was introduced just 13 years ago by Cynthia Dwork et al.

Screenshot from The Algorithmic Foundations of Differential Privacy by Cynthia Dwork, Aaron Roth

I’ll leave it to you to do the research on differential privacy if you want because it is out of the scope of this article.

Laws and Regulation

While there are laws like GDPR preventing companies from misusing data, capitalism always finds loopholes. For example, Google offers their service for cheaper if you allow them to use your data:

And this has no guarantee that they will store this safely and use it anonymously. So this is still a tough problem to solve.

--

--

Sidharth Ramesh
The Startup

Interested in data-driven healthcare. Founder and consultant at Medblocks.