Understanding the effects of label noise on healthcare prediction accuracy

John Frame
3 min readMar 27, 2019

--

Part 1 of a 2 part series

DataScience@HF

Image courtesy of Wikipedia

What is label noise?

Label noise refers to misclassified data; data which is incorrectly recorded. Machine learning practitioners use existing observations to predict outcomes on incoming data — but if the model is trained on noisy data that doesn’t resemble historical data, its predictions will underperform. Oftentimes when using healthcare data we can easily identify true positives but accurate true negatives become murky.

EHR (Electronic Health Record) systems are often varied, hard-to-update, out-of-date, and lack sufficient oversight of data quality. Additionally, the exchange of accurate medical data is limited in scope and greatly siloed which further exacerbates the conditions leading to high levels of label noise.

For instance, hospitals have a high incidence of unreported adverse events. During my time in QA at a large hospital, I was tasked with predicting these events during a given inpatient stay. Unfortunately, these observations were self-reported and so if a clinician felt afraid to report a mistake or incident we would have no record of it. In addition to negative class label noise, we were facing a problem with extreme class imbalance.

Although the examples presented above are examples of accidental mislabeling, label noise can be introduced by a malicious 3rd party. As data-driven analytic tools gain prevalence in the healthcare industry, there will be a larger impetus for bad actors to intentionally corrupt data sources. For example, Dermatology practices typically operate under a ‘fee for service’ model which incentivizes clinicians to complete more procedures. As a result, some doctors perform unnecessary procedures to increase their billed amount. Recently, a physician in Florida was convicted of performing several-thousand purposeless surgical procedures. If a machine learning solution were developed to authorize necessary procedures, one could create label noise to trick the system into approving a claim.

Research shows that noise from both accidental and adversarial mislabeling can trick a diagnostic model into making an incorrect life-and-death decision. Slight deviations in training data can cause grave consequences in the real world.

Applying label noise concepts to healthcare fraud

When developing a model to detect provider billing fraud we only have a few positive examples of fraud — providers who were caught. Unfortunately, they do not represent the “universe” of fraudulent providers and so there exist providers in our dataset who have committed fraud but are considered ‘non-fraudulent’.

Similarly, Healthfirst’s Social Determinants of Health analysis was performed on noisy data. The CDC describes the social determinants of health as the economic and social conditions which influence individual and group differences in health status. A previous analysis determined that our members face more financial troubles than we know. This means that within our member population we have false negatives — members who have not made claims indicating that they suffer from economic hardship.

Are our models still applicable?

The good news

We hypothesize that although label noise is pervasive in data science tasks its effects are not strong enough to invalidate our models! Furthermore, one can estimate the max deterioration of a model based on its initial performance on a separate validation set.

As proof-of-concept DataScience@HF downloaded 13 open-source benchmark datasets and randomly mislabeled the 0-class at different thresholds to examine the decline in accuracy of a random-forest trained on incorrectly labelled data. At each threshold (for each dataset) we recorded the ROC/AUC, brier score, and logarithmic loss. To assess decline we used “percentage of baseline metric” (eg. final metric at 90% misclassified data divided by metric at 0% misclassified data).

In my next blog post, we’ll talk about how we did it, why it matters — and why we’re relatively insulated from the adverse effects of label noise!

--

--

John Frame

Data Scientist at Healthfirst. Passionate about leveraging technology to bring high-quality care to our members!