Can You Predict My Star Rating?

Yubin Park
accordionhealth
Published in
5 min readAug 4, 2016

At Accordion Health, we help Medicare Advantage (MA) plans improve their Star Ratings. We identify the right measures to focus and provide data-driven platforms that engage members and providers to improve those measures. The MA Star Rating System is arguably one of the most sophisticated and complex quality rating systems; 44 different measures are collected from multiple sources, cutoffs for Star Rating measures change annually, and a large portion of enrollees jump around different plans year to year. The dynamic nature of the system makes it undesirable to use the latest Star Rating data to project future ratings and determine focus points. A better solution is to develop data analytics to predict a plan’s Star Ratings for the next few years.

Many people have asked us whether accurate Star Rating prediction was even possible. It turns out that it is feasible, and we will illustrate the basic idea using public datasets. The results are even more impressive when combined with a plan’s private data sources.

Predicting the Blood Sugar Control Measure

We will illustrate our methodology by predicting the Star Rating of the Blood Sugar Control (BSC) measure (C15 in 2016). Our prediction process can be summarized in three steps. First, we predict the raw scores of the measure. Next we calculate the Star Rating cutoffs by applying a Hierarchical Clustering algorithm. Finally, by combining the predicted raw scores and cutoffs, we derive the Star Rating of the BSC measure.

The Centers for Medicare and Medicaid Services (CMS) release the raw scores of the BSC measure every year. Although the data is available from 2008 to 2016, the data formats are different across years. Thus we needed to spend a considerable amount of time on cleaning and preprocessing the data to merge the files. After all this dirty work, we can visualize the temporal trajectories of the BSC measures.

dbBloodSug_all_plans

The chart above shows the raw score trajectories of the BSC measure, where each line represents the scores of the same MA plan. Do you see any patterns in the chart? I know. It looks very much questionable if there is any significant pattern at all.

Step 1: Estimating Improvement Momentum

What is the first (and perhaps most important) step of applying machine learning algorithms? The answer is the feature engineering step. Without proper features, top-notch machine learning algorithms fail to produce meaningful predictions. As the famous saying goes, “garbage in, garbage out”.

From our carefully curated public dataset, we have extracted thousands of different features by engineering temporal characteristics, parent organization structures, enrollment regions, and other relevant Star Rating measures. Amongst these, we have found a set of features that are extremely powerful, which we call “improvement momentum features”. While the theory behind the features is quite complex, the basic idea is simple. If an MA plan has been constantly improving its score over the years, it is very likely that the momentum of such improvement will continue in the future. We use various ways to quantify this momentum, which is reflected in our feature set.

Step 2: Every Plan is Different, Yet Similar

Another secret of improving the raw score prediction is acknowledging the fact that every plan is different, yet similar. There are more than 400 different MA plans out there. Many of those MA plans, however, can be grouped into fewer categories by either their parent organizations, enrollment regions, growth rates, types of organizations, temporal patterns, or other characteristics extracted from our custom algorithms. With this approach, you can come up with thousands of different grouping schemes. Among these thousands of groupings, we are interested in grouping schemes that extract homogeneous (or similar) raw score trajectories. Does this process sound very laborious? Of course, we have automated this grouping process with our custom machine learning algorithm.

After hundreds of meaningful groups are extracted, our algorithm looks at the group-level trends of the raw scores. Since the MA plans in the same group will share similar temporal trajectories, we can calibrate the prediction outputs based on the other plans’ predictions in the same group.

Validating Predictive Accuracies

To validate the predictive accuracies of our algorithms, we design a retrospective experiment. We hold out the raw scores in 2016, and train our model using the data from 2008 to 2015. We evaluate our algorithms by predicting the raw scores in 2016, and comparing with the true scores in terms of Root Mean Square Error (RMSE). In other words, we are simulating a situation where we predict the next year’s raw scores. Here is the list of the models used in our benchmark.

  • Baseline: This model uses the current year’s scores as predictions for the scores in the next year. For example, if a plan’s BSC measure was 81% in 2016, this baseline model will predict that the plan’s BSC measure will be the same as in 2016 (81%).
  • ML.step1: This is our custom machine learning model with the improvement momentum features.
  • ML.step2: This model extends the ML.step1 with the grouping calibration technique.
dbBloodSug_accuracies

The chart above shows the RMSEs of baseline (blue), ML.step1 (red), and ML.step2 (yellow) models. Both the ML.step1 and ML.step2 models show drastically improved predictive accuracies compared to the baseline’s RMSE.

Next, we calculate cutoffs using the predicted raw scores. We apply the same Hierarchical Clustering algorithm that CMS uses to calculate the cut-offs. We derive two different cutoffs using the baseline and ML.step2 predictions. With these cutoffs, we finally derive the predictions for the BSC Star Rating. As we are comparing the predictive performance of the ordinal variable (i.e. Star Rating), we define our error metrics as follows. Our prediction is a success if the predicted and actual Star Ratings are within at least one Star difference, and a fail otherwise. The results may surprise you a bit.

  • Baseline Star Rating Prediction: 21.5% errors
  • ML.step2 Star Rating Prediction: 2.2% errors

Our predictions are far better than the baseline predictions, almost by a factor of ten. Note that we have used only public datasets so far, and the performance will improve further with your private data sources.

It Is Possible To Accurately Predict Star Ratings

Better strategy starts with better prediction. If you want to know more about your 2017 or 2018 Star Rating predictions, please contact at info [at] accordionhealth [dot] com.

--

--