Artificial Intelligence in the Diagnosis of Rare Diseases
by Anni Coden
Unless you have a personal connection to them, you may never have heard of alpha-1 antitrypsin, eosinophilic esophagitis, or Barth syndrome. These are just three of an estimated 7,000 rare diseases that affect about 1 in 10 people in the U.S. Fifty percent of these rare diseases affect children — 30% of whom will die from them before the age of 5.
Those researching other, more prevalent diseases such as breast cancer and multiple sclerosis have a wealth of data to mine for possible treatments. Precisely because these other diseases are rare, researchers are hard-pressed to find similar data. That’s where artificial intelligence can help.
Early Diagnosis of Rare Diseases is Critical
Making researchers’ tasks even harder is the fact that the symptoms of rare diseases are often similar to those of more common diseases. In general, patients are treated first for the common illness. The symptoms of idiopathic pulmonary fibrosis, for example, include dry cough, shortness of breath and fatigue, symptoms similar to those of many other common conditions. It is only when the disease progresses that a proper diagnosis is made. Nausea, bloating, fatigue can be indicators of Crohn’s disease, but they can also be the early signs of the much rarer metastatic gastroenteropancreatic neuroendocrine tumor. With the right data, doctors may be able to make a diagnosis earlier and intervene to change the patient’s outcome.
Artificial Intelligence methods applied to big data can be used to overcome some of the challenges in diagnosing rare diseases. Given enough data, it may be possible to determine whether an undiagnosed patient’s symptoms are more similar to those of a common illness or to those of a rare disease.
To acquire sufficient data on rare diseases, a large pool of diagnosed patients has to be assembled. Given current health-privacy protection laws, however, this data can be very hard to amass — unless artificial intelligence methods can be used to overcome these limitations without breaching privacy.
Improving Patient Clustering
Let us assume for a moment that we have access to a very large pool of patients, their illnesses and their diagnoses. For the sake of simplicity, let’s further assume that these patients’ diagnoses fall neatly into one of two categories (rare diseases or common diseases) that share some but not all attributes of the patients and their disease journeys. This information would be sufficient to enable researchers to build a model for each group of patients sharing the same diagnosis.
There are many different machine learning methods to build such a classification model. Decision trees, random forests composed of many decision trees, and neural networks are just a few of the methods that can be trained on the known examples. There are also several techniques to test whether the classifications are correct or not. Once these steps are taken, it is time to apply the model to the undiagnosed patient to ascertain the cluster to which he or she belongs.
Classification methods like these can also be used to divide patients into several cohorts, or to divide groups into subgroups, such as the common traits of people who have not only the same diagnosis but similar outcomes after receiving the same treatment.
Patients descriptions can be multifaceted. They can include such data points as the patients’ illnesses, their symptoms and how they’ve changed over time, the treatments the patients have received, the tests performed on them and the results of those tests. This means that once a group of patients with the same diagnosis is identified, researchers can use different machine learning methods to cluster patients. Clustering could be done on similarities between patients or between disease parameters. It may turn out, for example, that patients who experience the same amount of post-treatment chronic pain belong in one cluster, while patients with pre-treatment high blood pressure belong to a different cluster.
Finding Enough Data to Build an Effective Model
The problem is that any such clustering depends on the aggregation of data from a large enough group of patients whose disease journeys and attributes are known. As noted above, privacy is one of many barriers to such aggregation. Patient data is stored in silos and, in general, made available only after it has been de-identified according to strict rules such as the blinding of names and other attributes that would identify individual patients. But it is not just the removal of this identifying information that gets in the way. The bigger problem is that without such identifiers, databases cannot be joined — and that means researchers cannot determine if a patient in one database is the same patient described in a different database.
The more general problem is that different databases contain different kinds and amounts of patient information. Electronic medical records (EMRs) generally specify disease details, test results and procedure findings. Insurance claims identify billing codes that refer to either diagnosis, procedure or health measurements and drugs dispensed. EMRs contain much fewer patients than insurance claims records. The federally funded Medicare and Medicaid programs cover vast amounts of data on patients, most of whom are 65 years old and older, but the data cannot be used for commercial purposes. Insurance claims in the United States include the name of treating physicians; those in Europe do not. Simply put, researchers must painstakingly stitch together several differing databases to get a richer picture of a patient’s disease journey.
When it comes to patients with rare diseases, though, even such diligent research can still fail to uncover sufficient data. For a rare disease, a single electronic medical records database will contain some patients, but usually an insufficient number of them. The only way to get a sufficiently large pool of data is to use insurance claims records. For example, let’s say we want to distinguish whether a patient has Crohn’s disease or is suffering from a gastroenteropancreatic metastatic neuroendocrine cancer. Insurance claims would specify a gastroenteropancreatic neuroendocrine cancer, but not whether it is metastatic (that it has spread to other organs). The approach would be to use EMRs to build a model of patients that distinguishes between non-metastatic and metastatic cancer, and then apply that model to insurance claims. Such a method is normally referred to as transfer learning.
Ensuring Patient Privacy
When using model building and transfer learning, patient privacy must still be preserved. A formal mathematical framework, based on differential privacy methods, guarantees privacy protection when analyzing or releasing statistical data. More specifically, it guarantees that by releasing the number of metastatic cancer patients in the database, it will not be possible to identify whether any particular patient has metastatic cancer. Differential privacy protects against the possibility of determining whether a particular individual is part of such model building, and then using that determination to link that person to a particular disease profile. Any such linking may have consequences that reduce an individual’s ability to secure employment or health insurance.
Typically, differential privacy works by adding some noise to the data, such as by adding fictitious patients to the database. But the introduction of noise creates a trade-off: The data becomes more anonymous — but also less useful. In differential privacy, this trade-off is formally controlled using a parameter called epsilon (ε). Research has already shown that many algorithms, such as histograms, linear regressions and clustering can be executed within the differential privacy framework.
As we have shown in this article, Artificial Intelligence methods can facilitate the diagnosis of rare diseases by building classification models of patients and then applying those models to a not yet diagnosed patient. Transfer learning methods and differential privacy frameworks enable us to correlate and link multiple databases so we can have enough data to build these models without infringing on patient privacy. And doing so has the added advantage of helping us understand treatment patterns and outcomes across many different scenarios.
References:
http://privacytools.seas.harvard.edu/files/privacytools/files/pedagogical-document-dp_0.pdf
https://rarediseases.info.nih.gov/diseases/pages/31/faqs-about-rare-diseases