Accordion APIs How-To series, part V
How does Arise perform compared to industry standard algorithms?
In our previous post of this series, we presented a case study that shows how our machine learning engine Arise determines whether a member is likely to have Parkinson’s disease: Arise builds hundreds of machine learning models, including gradient boosting and deep learning models, and ensembles the predictions of these models.
At this point, you may ask “I see that you are using very cool techniques. So, does Arise perform better than industry standard algorithms? If yes, by how much?”
Today’s post will be the answer to the above question. To measure and demonstrate Arise’s performance, we picked five diagnosis codes as follows:
- 10-C50919: Malignant neoplasm of unspecified site of unspecified female breast
- 10-C61: Malignant neoplasm of prostate
- 10-E119: Type 2 diabetes mellitus without complications
- 10-G20: Parkinson’s disease
- 10-J449: Chronic obstructive pulmonary disease, unspecified
** “10-” stands for ICD 10.
Note that 10-C50919 can only be found in a female member’s medical record, while 10-C61 only in male. We added them into the measurement in memory of Ada Lovelace (our Person of the Post), because she died of uterine cancer, which, like 10-C50919 and 10-C61, can be linked to Hierarchical Condition Category 12 “Breast, Prostate, and Other Cancers and Tumors”.
Before we jump into the numbers, we’d like to describe a little bit about how we set up the experiment. From our database, we randomly picked 5,000 members and their medical and pharmacy records in 2015 and 2016. For those members who had the target diagnosis codes in 2016, we removed the target codes from their 2016 medical records, but did not do so on their 2015 medical records (if the codes were present). After this step, we have a dataset that has a lot of missing target diagnosis codes in 2016 records.
Later on, we applied three different algorithms (see below) to this dataset to see how well these algorithms can identify the members that had the target diagnosis codes in 2016.
- Arise: Accordion’s machine learning algorithms
- Baseline-1 (a.k.a. Persistency Model): this is a method commonly used in the industry, by checking whether each member had the condition in 2015 to determine the existence of the condition in 2016.
- Baseline-2 (a.k.a. Generalized Linear Model or Actuarial Model): this is another commonly-used prediction method that uses a regularized logistic regression model with various diagnosis and demographic features.
Note that 10-C50919 and 10-C61 are gender specific diagnosis codes. For these two codes, we picked members that have appropriate gender features.
Now let’s take a look at some numbers about these target diagnosis codes.
As the stacked bar chart shows below, from the 5,000 randomly-selected members, 5% of them were diagnosed with 10-C50919, 12% with 10-C61, 23% with 10-E119, 2% with 10-G20 and 17% with 10-J449. Although these numbers do not necessarily represent the actual prevalence of each diagnosis code across the entire U.S., they do tell us one thing: even for the highest prevalent diagnosis code, the ratio between positive and negative classes are severely skewed — imbalanced class problems.
How well can it perform?
Due to the imbalance and binary natures of this prediction task, we will choose a metric called the receiver operating characteristic curve (ROC curve) to measure the performance of the methods we described earlier. The ROC curve was first developed by electric and radar engineers during World War II for detecting enemy objects in battlefields (obviously, it’s an imbalance and binary classification problem!).
Each method/model will have one ROC curve. To plot the ROC curve, we first recognize that each model outputs a probability that the member has the target diagnosis code. We then pick a threshold at which we classify the member as having the condition: if the model’s output is below the threshold, we classify the member as not having the diagnosis; if it’s above it, we classify the member as having the diagnosis. To generate the ROC curve, we vary this threshold between 0 and 1 and calculate the true positive rate (TPR) and the false positive rate (FPR). The ROC curve is the TPR against the FPR at these different threshold values.
Next we calculate the area under the ROC curve (AUROC or AUC) for these three methods. AUROC is a number between 0 and 1, with 1 meaning that the model has really good prediction performance and 0.5 meaning that the model is only as good as random guess.
From the plots above, we can see that:
- Depending on the target diagnosis code, the prediction performance can vary.
- Arise consistently outperforms both Baseline-1 and Baseline-2, no matter which condition we are targeting at.
- Sometimes, Baseline-1 outperforms Baseline-2 but sometimes doesn’t. It’s hard to say which method is better than the other.
Ending this series…
In this post, we evaluated the prediction performance of Arise and compared it with two other prediction models widely used in the industry. We created a test dataset, and applied three different algorithms on the same dataset to compare their performance. From the experiment, we conclude that Arise consistently outperforms other industry standard approaches, accurately identifying the members with the target diagnosis codes.
This post will be the final one in this series (time flies!). We’d like to thank you for your time and attention. Should you have any interest or questions about us, please visit accordionhealth.com or email us at firstname.lastname@example.org.
Oh wait! Don’t go too far away! We will come up a new series talking about our newest technology and show how you can thrive in the world of value-based care. Stay tuned!