How We Estimated Senior Housing Costs for 2,000 Cities and Towns at A Place for Mom

When you think of state-of-the-art machine-learning algorithms, I bet you don’t think about the senior living industry. At A Place for Mom, the nation’s largest senior living referral service, we’re changing that.

In April, we launched our Senior Living Cost Index, the first free data source where seniors and their families can compare senior housing costs across over 2,000 cities and towns in the U.S. We couldn’t build it without solving challenging statistical and machine-learning problems.

Warning: This post gets a little bit wonky.

The Statistical Challenges

Small local sample sizes — To build cost estimates for cities, metropolitan areas and states, we used a sample of over 50,000 senior move-ins to our senior living community partners. While 50,000 move-ins sounds like a large sample, it really isn’t once you break it out by city or even some states.

Lack of data for some care types in some areas — What’s more, we do not have partners for some types of senior care in some cities, but still want families to have a sense of what the costs for those care types might be in that area.

The median or average cost ain’t enough — Senior living costs have a broad range, even in the same city. To help seniors and their families understand their options, we should show them the range of costs that the majority of families end up spending, not just a point estimate. The breadth of the cost range may depend on some of the same factors — such as location or income — as the point estimate. That means we need to estimate full probability distributions of senior living costs at the city level using geographically sparse data.

Effects aren’t always linear or important — It makes sense to include median household income in any model of real estate costs. But our early graphical analysis led us to believe that the relationship between income and senior living costs is curvilinear and somewhat complex. We also found that it made sense to include varying intercepts for geographic groupings (e.g., city, Metropolitan Statistical Area, state), but we didn’t know how many geographic levels were necessary to model the move-in charges adequately.

Spotty geographic representation — The biggest issue is that our sample is not representative at the city or even state level. Election forecasters run into this same issue when using state polls to forecast national elections.

So to review, we’ve got small local samples, a target range that is difficult to estimate, a complex model with unknown structure and a sample that is not representative of the distribution of seniors across cities and states. What to do?

Multilevel Regression and Poststratification to the Rescue

Multilevel regression and poststratification (MRP) was developed to solve small-area estimation problems with geographically non-representative samples. The first step of MRP is to build a multilevel regression model with both top-level and varying group-level effects. The second step of MRP is to use that model to make posterior predictions for each possible combination of the predictors of interest in your study. The final step of MRP is to build a weighted average of these estimates at the desired geographic level, with weights proportional to the share of the population represented by each combination of predictors. This final step is called poststratification.

To make this more concrete, let’s talk about the predictors we used in our multilevel model. Our model includes effects for inflation-adjusted zip-code-level median household income (from the 2014 American Community Survey Five-Year Estimates) and care type. We also included varying intercepts for city, Metropolitan Statistical Area, state, and US Census Region. After building our model (which we describe in more detail in the next section), we made move-in-charge predictions for each care type in each of the zip codes where we had any move-ins. Then we looked up the 55+ population in each zip code to construct the weights. Finally, we took the weighted average of our move-in-charge predictions in a given city, metro or state. By the way, we got median household income and 55+ population statistics from the Census API via the acs package for the R statistical programming language.

Boosted Generalized Additive Models of Location, Scale and Shape are Freaking Awesome and Everyone Should Use Them

The MRP procedure helps solve our small-area estimation problem, but that leaves open the challenges of a complex unknown model structure, plus our desire to estimate the full distribution of move-in charges, not just point estimates. Thankfully, there’s an R package called gamboostLSS that can solve both problems simultaneously.

The “gam” in “gamboostLSS” stands for generalized additive models, which are an extension of generalized linear models that allow for smoothing spline effects in addition to linear effects.

The “boost” stands for “component-wise gradient boosting”, a method of estimating and selecting the effects of predictors in statistical models. As Hofner et al. put it:

The key idea of statistical boosting is to iteratively fit the different predictors with simple regression functions (base-learners) and combine the estimates to an additive predictor.

The boosting capabilities of the gamboostLSS package are based on the mboost package. It turns out that aggregating many simple models (called “base learners”) is a great way to navigate the bias-variance tradeoff.

The “LSS” stands for “location, scale and shape”, meaning that the package simultaneously models each of the parameters of the likelihood (say a normal distribution), including a location parameter (e.g., the mean parameter of the normal distribution), a scale parameter (e.g., standard deviation) and shape parameter (e.g., a zero-inflation parameter in a zero-inflated negative binomial distribution). The LSS strategy comes from the amazing gamlss package, which boasts one of the most beautiful and informative websites of any R package.

Another feature of the gamboostLSS package is random effect and random slope base learners. We used separate random effect base learners for city, Metropolitan Statistical Area, state and US Census Region and let the model decide how to weight their importance.

To use the gamboostLSS package, we had to find a likelihood suitable for the move-in charge data that was also from a model family supported by the package. Happily, move-in charges are a concave function of income, which suggests a log-normal distribution. We compared the quantiles of the log-transformed move-in-charge data to the quantiles of a normal distribution with the same mean and standard deviation. They were basically identical.

So we fit a gamboostLSS model from a Gaussian family to the log-transformed move-in charge data, optimizing the hyperparameters that tune the number of iteration steps for the mean and standard deviation regression models. Using the fitted model, we predicted the mean and standard deviation for each zip code, then took the population-weighted averages of these two parameters within each desired geographic area. Using the weighted-average mean and standard deviation, we computed the 95% prediction intervals of the log-transformed data, then exponentiated to get back to the original scale. We also used the conditional distribution parameters to simulate the median move-in charge figures.

So that’s how we did it. Hard to believe how much work went into making ourcity-search widget possible. Try it out for yourself.

What’s Next?

When it comes to the Senior Living Cost Index, we’ll continue to improve our model, experimenting with different types of models as well as with added base learners. We’d also eventually like to have more robust cost-growth estimates at the city level, which would add a time-series-analysis component to the task with all its complexities. We also want to do a better job cross-validating the model so that we can show people how accurate it is at different geographic levels. Finally, we should do some sensitivity analyses, specifically with the model family and transformations. For example, we could fit a lognormal regression rather than a Gaussian regression on log-transformed data. We could also experiment with different population-weighting schemes. Most of our residents are 65+, not 55+. And if we included an age group effect in our model, we could do weighted averages for each care-type/age-group/zip-code combination, which may lead to better predictions.

This is just the beginning of data science at A Place for Mom. We also built a model that predicts how long it takes to move into senior living once you call one of our Senior Living Advisors. Last quarter, we released our first Family Quality of Live Survey, which uses state-of-the-art sample-matching algorithms to compare quality of life between families who have moved to assisted living and those still looking for care. This quarter we’re looking at the relationship between senior housing costs and consumer ratings of senior living communities on SeniorAdvisor.com. Next year, we’ll update the Senior Living Cost Index, making it more user-friendly and more accurate.

So stay tuned.

Originally published at www.aplaceformom.com on October 12, 2016.