Fraud, Waste & Abuse (FWA) Prepay Risk Scoring at Oscar

Oscar Health

Published in

Oscar Tech

7 min readJan 20, 2022

How we implemented a fraud detection model running live in our claims system without any training data

By Solomon Foster

Context

Healthcare fraud, waste and abuse (FWA) is a societal ill that plagues every payor in the United States, from Medicare and its more than 60 million beneficiaries all the way down to smaller private insurers like Oscar. There are many flavors of FWA, but a simplified narrative for understanding the phenomenon is that healthcare providers, including individual physicians, hospital corporations, and lab companies, are typically reimbursed for the services they provide in proportion to the frequency and complexity of those services. This incentive structure gives rise to many of the most common FWA schemes: upcoding, for instance, is the practice of consistently billing a more complex version of a service than was actually rendered in order to increase the reimbursement received from the insurer. Other FWA schemes include inappropriate use of claim modifiers resulting in overpayment for procedures, and tacking on codes for services that are not supported — often labs, imaging, or niche examinations — to a basic service.

Data Mining in FWA

At Oscar, the main purpose of data science in the FWA space is to capture potentially fraudulent, wasteful, or abusive billing behavior in our claims data. We conduct research into FWA schemes and aberrant billing patterns like upcoding and configure data mining routines called signals that search for providers perpetrating them. Signals are often one-to-one with a given FWA scheme, and have a core value, called a metric, that quantifies the FWA-predictive billing behavior. We aggregate these metrics for each billing entity (typically a provider’s NPI) for whom the signal is relevant, and further segment the aggregation along variables, such as inpatient/outpatient, that are meaningful when a claim we suspect was improperly billed undergoes medical coding review to determine its legitimacy.

Motivating Prepay Risk Scoring

After relying initially on binary flags for providers suspected of perpetrating FWA, we set out to substantially increase the sophistication of our data mining by developing a probabilistic notion of potentially fraudulent billing derived from our existing set of signals. A probabilistic notion of FWA allows us to assess the potential ROI of conducting medical coding reviews on a claim-specific basis. In the Oscar FWA team’s nascent stages, coding reviews were carried out at the provider level, which surfaced many claims that weren’t valuable to review. A claim-specific understanding of the probability of FWA also enables us to implement a system that holds high-predicted-ROI claims in prepay, avoiding the arduous process of trying to recoup funds from providers after claims are paid out. We call the model that makes these ROI predictions The Prepay Risk Scoring Model.

The Prepay Risk Scoring Model

The model we use for Prepay Risk Scoring is a logistic regression in which each feature is an FWA signal and the response is the probability that a claim line will be judged improperly billed during medical coding review. The initial rollout of the model limited the feature set to FWA signals we had already configured, including signals for office visit upcoding and revenue code upcoding. In theory, the feature set could be expanded in the future to any claim or claim-derived attribute we think might influence the likelihood of FWA.

The difficulty of executing the model was not theoretical but rather in the circumstances of its implementation. One major challenge was that in order for the model to make predictions in prepay, we had to configure it in the domain-specific language our homegrown claims system uses to adjudicate claims. This meant we needed to map signals, the features of the model, to claim-level attributes like CPT code or in or out of network, so that the feature values could be looked up in our production databases based on the input claim. The feature weights of the model also had to be looked up from the claims system, which necessitated a model with interpretable coefficients like a logistic regression.

The other major challenge was that we had no training data! Because this kind of prepay system is novel at Oscar, we hadn’t done reviews at a granular enough level to fit the regression. Two key analytical steps — (1) Feature Engineering and (2) Creating Synthetic Training Data — helped us overcome this obstacle and permitted us to implement a model that we were comfortable with conceptually, even though it was not based on real review data.

1. Feature Engineering

The first analytical step we took created consistency across the inputs to the model. Signals, the features of the prepay risk scoring model, are capable of capturing outlying billing behavior, but the metrics they rely on aren’t standardized either within the scenarios of a given signal or across signals. In the case of upcoding, it is difficult to tell how a provider scoring an average complexity of 4.6 out of 5 for the initial office visits scenario (CPTs 99201–5, where the final digit represents the relative level of complexity of medical decision making involved in the visit) should be compared to a provider scoring an average intensity of 2.9 out of 3 for the definitive drug testing scenario (CPTs 80305–7, where the final digit represents the sophistication of the instruments used for the drug test).

To create standardization between scenarios, we constructed reference distributions from the metrics within a given scenario and computed standard scores against the reference distributions (fig 1). Further, we ensured the reference distributions were robust to small sample-size issues by using a Bayesian hierarchical model, which outputs distributions that are a compromise between the signal-level distribution as a whole, the fully pooled or population-level distribution, and the scenario-level distribution specifically, the fully unpooled or observed distributions. When a given scenario is relatively poorly defined because of a small sample of data, like for a scenario based on a relatively new or uncommon CPT code, the output distribution is much more similar to the pooled, signal-level distribution, whereas when we have a lot of data for a given scenario, like for upcoding office visits, the output distribution is very similar to the distribution of the specific scenario. (See here for a good walkthrough of a similar modeling problem).

The outcome of normalizing signals in this fashion is a feature set that

Consists of statistical measures of billing outlyingness rather than arcane metrics
Is interpretable both between and within specific features
Adjusts conservatively to new claims, new billing behavior and new procedure categories

These qualities are critical for a feature set used in a model that does not initially leverage real training data.

Fig. 1. Constructing Reference Distributions Using Bayesian Hierarchical Modeling

2. Creating Synthetic Training Data

The second key analytical step we undertook solved the challenge of setting weights for the different regression features without any review outcome data to use as labels. We realized in short order that, even if we made our lives easier by assuming uniform weights for each of the features, we couldn’t simply set all of the weights to 1, say, and get reasonable predicted review probabilities out of the other side of the logit function (fig 3). Fraudulent claims are roughly Pareto distributed: most don’t have issues, while claims billed by some of the most outlying providers are very likely to be improperly billed. To satisfy this and a few other requirements for our model’s predicted probabilities, we decided that the best approach would be to fit the regression weights on fake claim-line-level outcome data, which we call “synthetic labels,” rather than to try to set the weights explicitly. To accomplish this we specified a function that maps feature values to expected probabilities for each signal; an upcoding score of 0 (roughly average) might map to a 7 percent hit rate, while an upcoding score of 2 might map to a 60 percent hit rate (fig 2).

Fig. 2. Creating synthetic labels

To create the synthetic labels we then sampled from a binomial distribution where P = the mapped probability corresponding to each claim line. Finally, we simply fit our regression weights on claim lines and their synthetic labels, and the model was ready to be used in production.

Figs. 3 & 4. Modeled probabilities w/ weights set to 1, and using the synthetic label method

Results and Next Steps

Bayesian feature standardization and generating synthetic data allowed us to implement a risk scoring model that is conceptually well-founded, despite not being based on any real training data. This version of prepay risk scoring is live in the claims system at Oscar and is already successfully holding improperly billed claims at a high rate: the initial launch of the model on a subset of upcoding scenarios has generated a review hit rate of 60% on more than 500 claims.

The large number of compromises and assumptions we made in building this minimum viable model have left us with no shortage of improvements to work on over the next months and years. Upcoming projects include incorporating realized review hit rates into training model weights, expanding the model’s feature set with high quality new signals, ascertaining already-implemented and prospective feature collinearity, and methods to sample from new sets of claims with unknown expected ROI. We’re excited for what the future holds as we continue to leverage our differentiated technology to drive innovation in fraud detection.

Solomon Foster is a Data Scientist working on analytical methods and data infrastructure to detect Fraud, Waste and Abuse

Want to talk more tech? Send our CEO, Mario, a tweet @mariots