Ground Predictive Healthcare Models in Real-World Medical Data — and in Real-World Medical Practice

Erwan Bigan
GAMMA — Part of BCG X
7 min readFeb 22, 2019

With every passing day, advanced analytics that leverage machine learning or artificial intelligence are opening up new possibilities in the field of healthcare. In the recent past, the field has seen significant advances in the use of automated image analysis for cancer detection. What we are now witnessing is the rapidly increasing breadth and depth of available healthcare data and of increasingly sophisticated artificial intelligence algorithms. Tempered by significant input from healthcare practitioners, this combination of data and analytics promises to lead to the development of increasingly accurate predictive treatment models.

In the case of medical treatment involving medication, advanced analytics enable healthcare practitioners to predict those patients for whom a given drug will work best, those who are at increased risk of medication side effects, and optimal times in the course of care to initiate specific drug treatments. For patients, these predictive powers can lead to improved outcomes from more targeted treatment. For healthcare systems and insurance companies (payers), these analytics can improve the ability to negotiate drug prices and levels of reimbursement based on the expected value. And they can enable pharmaceutical companies to better promote their drugs through such strategies as improved labelling and more precise patient targeting.

Patient Data Comes from Two Specific Sources

The data on which these predictions are built come from two main sources: clinical trials and real-world data gathered as patients receive treatment. Clinical trials are among the most important data sources regarding patient responses to medication, including data on improved outcomes and side-effect risk. These trials typically consist of as many as a thousand patients, and generate data that is very exhaustive, homogeneous (the same measurements are available for all patients), and with clearly defined end points. In point of fact, clinical trial data is the only data source for drugs not yet on the market. (This data is normally the sole property of the pharmaceutical company whose drug is being tested in the trial.)

Real-world data is that which has been generated and collected in the field, usually after a drug has been taken off trial and is available to the general public. The data can range from claims or reimbursement data from health insurers to electronic medical records collected by healthcare providers such as private medical doctors or hospitals.

Compared to clinical trial data, data based on real-world evidence is generally available for many more patients, thus providing information on the effectiveness of a drug across the general population. Unlike homogenous clinical trial data, real-world evidence is typically heterogeneous, since not all patients see a doctor at the same time and patient profiles are more diverse than in clinical trial settings. Real-world data also provides more longitudinal information because the patient information can be captured over a longer time span than the typical duration of a clinical trial and does not contain clearly defined end points.

The Power of Combining Data Sources

While relevant analytics can be implemented using only claims data from payers, more robust data sources that combine claims data with real-world clinical information, such as electronic health records, are richer and may lead to better predictive models. It is possible, for example, to infer a patient’s medical status using information based on the drugs they have been prescribed or the physicians who have treated them. A more precise picture, however, can be created when that information is combined with the results of blood tests or other diagnostic tools.

These more precise patient models can then be leveraged to map and predict more general patient journeys. For example, major categories (clusters) of disease-progression trajectories and care pathways can be extracted from the data, and advanced modeling techniques can be devised to predict future patient trajectories as a function of the chosen treatment option. Such models can be used to determine the best therapeutic option at any point in time for a given patient. The challenge is often gaining access to this real-world patient information.

Accessing Real-World Data Can Be A Challenge

Some vendors, especially those serving patients in the United States, sell access to real-world data such as claims, integrated claims or electronic medical records. In some countries and under specific conditions, national payer systems that cover consumption of any medical services, or national registries that cover specific disease areas such as cancer, may make their data available.

Broadly speaking, access to commercial sources tends to be costlier, but usually comes without topical restrictions. While leading commercial sources cover primarily U.S. patients, clinical aspects of a disease derived from these patients can often be transposed to other regions such as Europe, taking into account any differences in clinical practices. Access to European national payer systems may be free or at a minimal cost to reimburse the system for any specific data-preparation expenses, but the institution may limit the ways in which the data can be used for analysis, and it may take longer to access the data. Gaining access to data from the French national payer system, for example, can take as long as six months.

Another limitation is that, typically, national payer systems cover claims data only. Legislative steps have been taken in some countries to extend coverage to electronic health records, but this process will likely take several years to complete. Fortunately, access to clinical data for very specific disease areas (e.g. cancer) is possible today through dedicated national registries. Real-world evidence can also be accessed by working with a targeted payer or healthcare provider to carry out joint projects.

The Right Fuel for Machine Learning

Clinical trial data lends itself particularly well to artificial intelligence and machine learning because the data usually includes a clear baseline comprised of information collected before treatment initiation, clear outcomes based on measurements during or at the end of the clinical trial and clearly defined end points.

To use real-world evidence as a basis for machine learning, however, it is often necessary to reconstruct end points from the longitudinal data. For example, treatment success or failure will not be codified as such in the data: It must be inferred, such as from laboratory data or subsequent treatments or procedures. The advantage of this real-world evidence is that it presents a diversity of patient profiles and prescription patterns that cannot be matched by clinical studies. As such, real-world evidence uniquely enables patient-journey and disease-progression mapping.

Using either clinical trial data or real-world evidence, it is possible to build as many as a hundred or more baseline patient characteristics before the start of treatment, or at a given stage of a disease. These features can include laboratory measurements such as blood and urine tests and genotyping information, concomitant medication, medical history and demographic information. If a given outcome is to be predicted, recursive feature elimination can then be used to determine an optimum feature set, which typically consists of ten or fewer features. Compared to classical statistical analysis, machine learning based on this kind of rich data can make possible highly multi-dimensional multivariate analyses that leverage sophisticated algorithms such as random-forest or gradient-boosted decision trees.

Keeping Real-World Practitioners in the Equation

Realizing the full potential of advanced analytics for healthcare applications usually requires going beyond mere predictive models and changing the operating models themselves. To do so, a number of challenges must be addressed. For one, when healthcare practitioners are presented with predictive models, they often display a greater interest in the nature of predictors themselves than in the actual model performance. Before they are willing to adopt a model, practitioners must be convinced that both the predictors — and the way the algorithms use the predictors to arrive at a prediction — make medical sense.

Some medical doctors suggest that algorithms should be so simple that they can be remembered by practitioners. The use of simple decision trees may seem like the best path to communicate the logic behind these models. However, the performance of standard simple decision-tree algorithms, such as CART, is generally insufficient. For this reason, the development of predictive models for healthcare applications usually proceeds into two phases:

Phase One: The best model — the one with the maximum predictive power — is developed.

Phase Two: This optimum model is then translated into simpler, interpretable, actionable tools that comes with a limited number of predictors. An example of such a tool is a fixed-structure decision tree with optimized cutoff values.

Data Alone Not Enough to Inform Medical Decisions

Stakeholders across the healthcare ecosystem are increasing their use of machine learning and artificial intelligence to benefit individual patients and the healthcare ecosystem as a whole. Generalized access to future data sources such as whole-genome sequencing data and data from personal connected devices such as medical monitoring devices and smartphones will further expand the potential of advanced analytics for healthcare.

At the same time, and no matter what the application, using predictive models to assist in medical decision-making requires the creation of stringent performance requirements. For example, the inclusion of false positives (patients for whom the model falsely predicts disease progression) should be minimized when predicting disease progression, and a specific process must be envisioned to deal with false positives. Such a process might include the addition of medical exams to rule out these false results.

Conversely, when predicting potentially harmful side effects, false negatives (patients for whom the model falsely predicts that they are safe) should be minimized, with attention paid to the background incidence rate of such events in the general population.

The combination of expanding data sources and increasingly sophisticated predictive modeling bodes well for improved patient outcomes and increases in efficiency throughout the healthcare ecosystem. The challenge will come in identifying and accessing the appropriate data sets on which to base these models. Steps must also be taken to make healthcare practitioners both understand the medical sense behind the models and are able to play an ongoing role in providing a real-world check on how the models are applied to actual patients.

--

--