Machine Learning in Action for Compass’s Likely-to-Sell Recommendations
Varun D N, Panos Ipeirotis, Foster Provost
[Thanks to Compass’s NYC_AI & CRM teams!]
We previously wrote about Compass’s Likely-to-Sell recommendations, the first product in a line of AI-powered suggestions for helping real estate agents to increase their business systematically and efficiently. These Likely-to-Sell (LTS) recommendations guide agents in the nurturing of their social and professional relationships.
In this post, we will dig more deeply into the “AI-powered” part. Our goal is not to present fancy math behind the AI algorithms but to describe the often-overlooked details involved in building a real, business-enhancing AI product. Many of the pieces are not what one finds typically in published discussions of AI for business or even in most machine learning or data science classes.
The first half of this post walks through the data science process, describing the problem formulation and the more-complicated-than-usual machine learning (ML) setup. We then present our initial look at evaluation, which goes beyond the evaluations we see in ML classes (much more on that in a future post).
In the second half of this post, we show several examples of homes’ actual LTS predictions and examine why these homes are deemed likely to sell, which gives insight into “how the AI thinks” and, in particular, shows the substantial non-linearity of the learned model.
Machine learning for Likely-to-Sell estimations: Problem formulation
We are going to talk about “supervised” machine learning, which means (using machine learning jargon) that we will take advantage of instances of homes that are labeled — i.e., we know whether they have or have not sold. We will use this labeled data in the machine learning phase to produce a predictive Likely-to-Sell (LTS) model that will then be used to estimate the likelihood that a current home will sell soon.
To implement a supervised machine learning solution for the LTS recommendations, we followed a fairly standard process. The first steps in the process focus on the most crucial part: precise problem formulation.
Define the target variable: Whether a home will sell soon (or not)
What does it mean for a home to have sold? The key will be to have data about the home at a particular point in time. For that specific time point, we can ask whether the home sells within some future period, say, the next 12 months. That will be the ground truth for our “target variable.”
Of course, if we were to predict today the likelihood of a home selling, we may not know our target variable’s ground-truth value until this point next year. This introduces a complication for creating labeled training data and is in contrast with many machine learning applications that we read about, where we know the actual value of the target variable very soon after we make predictions. Since we need a twelve-month window to tell whether a home has (not) sold, we train our models using some (past) time window. Specifically, we pretend that we are at a time point in the past (say January 1, 2019) and label as positive instances all the homes sold within the subsequent 12 months. All the other homes are labeled as negative.
Why 12 months? In real estate, there is significant seasonality in sales activity. Having a 12-month window allows us to sidestep that complication in the early stages of modeling. A 12-month window also makes sense in the context of our particular business application. As described previously, in giving LTS recommendations in the Compass Customer Relationship Management (CRM) system: The goal of our Likely-to-Sell recommendations is to help Compass agents engage systematically with specific contacts: provide both guidance and discipline. The recommendations focus on a small number of connections that are most likely to generate business soon. Selling within the next year captures both homeowners ready to sell now and those who will be ready to sell in several months — at which point our agents would like to be top of mind.
Define the features: What correlates with homes selling soon?
Defining our target variable is just the first step in the machine learning problem formulation (and you machine learning aficionados will know that the problem formulation is iterative — returning to reformulate as we learn more is standard operating procedure).
Next, we ask: how exactly are we going to formulate data on the home, such that the data will be predictive of selling in the next 12 months? In machine learning jargon, this is “feature engineering.” The “features” are the data points on the home that the algorithms will draw on for their predictions.
Others have cataloged the fundamental drivers of selling homes. Our blog post introducing our LTS recommendations presented some of the general categories of features that our modeling considers:
Here is a partial list of factors that the likely seller modeling currently considers (we add new features continually):
- Details about the property (bedrooms, bathrooms, square footage, etc.)
- Time since the last sale and frequency of past transactions for the property
- Home value appreciation; home value compared to others in the neighborhood
- Mortgage status and estimated equity held in the home
- People movement data (percent of owners, renters, how often they move)
How do we go from these high-level factors to specific, useful features?
Don’t we need to create features capturing the drivers of home selling? Feature engineering depends on two things: (1) understanding the underlying phenomenon being modeled — why homeowners sell, and (2) understanding what data might be acquired that might correlate with future selling. Note that we used the term “correlate” and not being “causes” of selling. To predict, while we do want to do our best to understand the causal drivers of selling and to engineer features based on this causal understanding, it is unnecessary to capture causality in our model. As an example, a homeowner outgrowing the house may be one causal driver of selling. However, we do not use data on family composition, so we won’t know that the homeowner has outgrown the home. On the other hand, a homeowner having lived in a starter home for five years may correlate strongly with a substantially increased likelihood of selling, even though for many homeowners, the simple fact of being in the home is not a causal driver of wanting or needing to sell. The difference here is subtle but is the basis for successful machine learning, especially when data on causal drivers are not available or observable. Below, in the examples showing what our model is using to predict a high likelihood of a sale, we can see this in action.
But didn’t deep learning kill feature engineering? As “deep learning” has become increasingly popular, some people tend to think that “feature engineering is dead.” This is not true, except in limited settings. Deep learning does indeed allow learning complex combinations of raw features that we already have in our data. For tasks where all the necessary information is included in the raw data (e.g., “detect objects in a given image”), with enough training data, deep learning can indeed learn representations that are much better than hand-crafted features. However, for many tasks, feature engineering also includes modeling the relevant context for a particular domain, acquiring the right data, and incorporating domain knowledge into the problem formulation, which is especially important when we do not have a massive amount of training data. For LTS modeling, feature engineering includes locating data sources that provide information about the fundamental drivers of selling homes and building these sources into our training and inference data.
In a separate blog entry, we will explain in detail a (causal) “generative” model of the fundamental influences on a home’s likelihood to sell and the relationship between this generative model and feature engineering for machine learning.
Training setup in more detail: temporal holdout formulation
Note that standard cross-validation or other “typical” methodologies for holdout evaluation are not very well suited for this problem, because here the target label is not realized until up to a year after the predictions have been made — but for testing, the features should be computed before the evaluation period begins. Of course, the values of the features for training should be calculated at the beginning of the training period; the values of the features for testing should be calculated at the beginning of the testing period. Furthermore, in practice, the models will be learned from sales in one year and then predict sales in the next, and in machine learning, it is generally best for testing to simulate use as closely as possible.
Figure 1. The temporal relationships between the training periods, test periods, and use (inference) periods.
Let’s talk about training and testing. Our test set for each training year uses as the prediction target whether or not a home sold in the year following the year of training. To test before 2020, we evaluate the quality of the 2018 model by examining how well it can predict whether a property will be sold in 2019. Specifically, the features were calculated for training as if the date were January 1st, 2018, and the target for training is whether the home sold within 2018. For testing, we calculate feature values as if the date were January 1st, 2019, and we are predicting whether a sale will occur in 2019.
For actual application (use) in 2020, we retrain the model on the 2019 data and then recalculate the features for use in 2020.
The figure hides an additional major complication. Different geographic areas exhibit very different home-selling behavior. For example, even the base rate of selling across large regions (say, NYC vs. San Francisco) can vary by a factor of two. In theory, given enough training data and including features about the geographic area, the machine learning could learn these differences. However, instead of trusting the ML to learn this, we separate the modeling into several dozen different geographic areas and learn models separately for each region. Therefore, there are several dozen replicas of the training process depicted in Figure 1. This will become critical when we discuss evaluation and model comparison, as:
(1) There is never “a model”; there are several dozen, one per geographic region
2) One model “version” or one training procedure likely will not do better uniformly across all of the dozens of regions.
More on that later.
Can ML capture significant signals about LTS to be useful at all?
Machine learning projects have substantially higher “science uncertainty” than most other non-research IT projects that companies routinely undertake. We cannot say with certainty that we will achieve a level of predictive accuracy sufficient to achieve the product goals. This is separate from the usual product uncertainty and engineering uncertainty (which machine-learning-based products also share). What’s more, in a new domain, we rarely have enough in-depth knowledge to know the best problem formulation, features to use, or ML algorithm to apply.
We will describe the machine learning algorithms used in more detail in a later post. We can summarize as follows: Linear modeling did not work well; Non-linear modeling (ensembles of trees, like Random Forests) worked well.
Walking through some real examples below, we will see the non-linearity in action more than once.
Figure 2 shows an example of the Areas under the ROC curve (AUC) for five major deployed modeling iterations (“versions”). We will describe the iterations in more detail in a later blog post. The point here is to illustrate the iterative improvement we just discussed. Specifically, left to right in the figure, each box-and-whiskers plot is the results of a subsequent successful modeling iteration — the different versions represent some result from R&D reaching sufficient maturity to be productized. So V1 was the first version that was “useful enough” to deploy. Each subsequent version was more useful. You might notice: hey, V4 doesn’t actually look to be better — it has lower AUC and higher variance! Recall what George Box taught us! Model V4 was more useful not because it was more accurate, but because we figured out how to include large sets of previously excluded properties without sacrificing accuracy; it covered substantially more homes and therefore provided many more useful recommendations.
Figure 2. The Areas under the ROC curve (AUCs) as the models evolved across five major deployments (“versions”). There are dozens of models for each version, one for each geographic area, and thus a distribution of AUC values.
Wait. We learn models from a past time period and apply them to the future. Are the models stable over time?
A natural question arises: Do models trained on past data perform well when asked to make predictions for future years?
To test this out, we train multiple models using different years as training periods (the sales from 2012 to 2019). This results in 9 models for each of the several dozen geographic areas of interest. Each of these models is trained in periods increasingly far from the testing period. If the underlying factors that drive home sales change significantly, then old models will not represent the current reality, and their performance will suffer. If the underlying factors remain stable, the performance of the models will be stable as well.
Figure 3 below shows how the AUCs of the models change when they are evaluated over the most recent testing period. To allow for easier visual comparison, we normalize the performance of the most recent model to 1.0 and show the performance of past models’ performance as a ratio of the AUC of the most recent model. Notice that the performance degrades gracefully as the models get older and older, giving us concrete evidence that our evaluations will not substantially overestimate our models’ performance when we use them to predict sales in the following year. (And then COVID happens, throwing a wrench on any form of “business as usual,” but that deserves a separate blog post.)
Figure 3. The stability of learned models over time. Models learned in each year are tested on data from 2019, and the relative AUC (% of 2019 AUC) are reported in the heatmap. By and large, models are quite stable, even going back several years.
LTS prediction in action: What are the learned models actually doing?
So, let’s look in detail at some examples to understand what the learned models are doing when estimating the likelihood of selling. Here we will look at specific predictions of models for the calendar year 2019. Recall from above that that means that for the analysis, for prediction (inference), we simulate what the state of the home would be on January 1, 2019, and featurize it accordingly (the model having been trained on 2018 — see Figure 1 above). Then we can observe whether each home actually did sell in 2019. These are not our current production models, but the behavior we see in these examples is representative.
In our examples below, we will be doing SHAP analyses of which features of the properties had the most influence on the model output scores and attempt to connect these with some (admittedly, speculative) causal reasons that may drive the sale of the particular properties.
Let’s start with a home that got a very high estimated likelihood of selling in 2019.
Example LTS Prediction #1
Our first example is for a 4 bedroom, 2.5 bath home in the Atlanta area. This home’s model output score was 0.48, which was substantially above the threshold (around 0.12) for classifying this home as having a HIGH likelihood of selling in the next 12 months.
The figure below shows a SHAP analysis of which features of this home had the most influence on the model output score. Specifically, the features closer to 0.48 in the chart have more influence on the score. The amount of influence is represented by the size of the red segment (and quantified by the scale above the segments). Below the largest segments are the feature names and their values for this particular home. (Ignore the blue, for the time being, we will discuss that in the following examples.) Necessary for interpretation here, the values for the variables are “normalized” with percentile scaling. Therefore, “years_since_sale = 0.89” means that the home has had the same owner for a relatively long time (in the 89th percentile) compared to all the other homes in the region.
The five most influential features for this home’s high prediction essentially reveal three things:
- the property is on the large size (both bedrooms and bathrooms are around the 60th percentile for the region), and
- it’s been a pretty long time since it was purchased (the years_since_sale is at the 89th percentile for the area)
- the price paid for the home is high, on a price-per-square-foot basis, relative to all the homes sold across all years
The second and third factors seem to be at odds, as we expect homes’ value generally to appreciate. However, this home was purchased in 2006, at the peak of the early-century housing boom. Prices have only just recovered to those levels in the past year or two.
So, what does that mean in terms of what the model is “getting at” here? We can’t know for sure, and we should be cautious about thinking that an AI model reasons the way we might. But hey, it’s a blog entry. Let’s speculate!
The fact that the home was purchased in 2006 relates to a common reason why many homes stay on the market: people don’t like the idea that they’ll sell for less than initially paid for their home, OR they cannot afford to sell because they are underwater on their loan. Thus, the model may well have learned that homes bought in 2006 (or thereabouts) no longer are “losses” in this sense, and therefore the homes become significantly more likely to sell. (This reason is exacerbated if the homeowner has built up little equity in the home; the model we are evaluating here does not take estimated equity position into account.)
The fact that 2006 is around 12 years ago provides information on home selling in a completely different way. If you moved to the neighborhood to put a kid in the local school system, this is about the point that the home will no longer be valuable for that reason. If there is any economic stress or desire to make alternative investments, we may see homes listing soon after the kids graduate.
Of course, this is just an estimation based on property sale data, not an identification of the type of homeowner. It is complicated by the fact that you might have moved before first grade, or after first grade, or you might have multiple children — but again, let’s not fall into the trap of thinking that the AI reasons the way we do. The model estimates the likelihood of a sale, not saying whether some particular family actually will sell or not (which we wouldn’t know even if we did know all of those other things anyway). So, if, in fact, 12 years, plus or minus, seems to be a cusp where the likelihood of sale changes for homes like this, that may be sufficient to increase the likelihood of selling substantially.
And note the “homes like this.” Here we get to the most influential features, according to the SHAP analysis. This is a bigger-than-average home — precisely the sort of home from which owners may be interested in downsizing after a number of years in the home, after various potential life changes.
Oh, by the way. This home did indeed sell in the subsequent year (2019).
Example LTS Prediction #2
Let’s look at another example to show that the high-likelihood homes identified by the models are not all deemed high likelihood for the same reasons. Example #1 showed a home whose larger-than-average size was the most influential factor for the model giving it a high likelihood-to-sell score.
Let’s take a look at a small home in the same (Atlanta) geographic region that nonetheless also was estimated to have a HIGH likelihood of selling in the next year. In this case, the model gave the home a score of 0.18 (again, the threshold for HIGH LTS in this region and year is around 0.12). The SHAP analysis below shows the influential features for this 3 bedroom, 1.5 bath home. Having only one-and-a-half baths puts this house in the lowest decile for the number of bathrooms, and having 3 bedrooms is below average in this region. Also, notice that the home’s square footage is in the lowest decile (it’s only 1100 square feet). All three of these small-size features drive the likelihood of sale up. This illustrates the considerable non-linearity in the predictive model: in the first example, the large number of bathrooms is the top factor pushing the score up; here, the small number of bathrooms is the top factor pushing the score up!
The other pair of features driving the high score is that the home was purchased a relatively long time ago (96th percentile), and the price paid is relatively low for the region (23rd percentile).
Those are the technical reasons the model gave the home a high score. Once again, we are left to speculate on what the model is “thinking” here. As the house sold more than 12 years ago, part of the speculative argument from above — that maybe the homeowners are done with high school — would be relevant here as well; on the other hand, the “downsizing” part of the argument does not hold. However, there are other potential reasons that sales likelihood can track with time-in-home. For a smaller home, the homeowners may have advanced in their careers to the point where they are ready for a larger home. Or at least are ready for a place with two bathrooms! (Recall that the small number of bathrooms is the #1 factor in the HIGH likelihood of selling.)
And by the way. This home did indeed sell in 2019.
Example LTS Prediction #3
Let’s examine a third example. Consider the following home, also near Atlanta. It’s a small, 3 bedroom, 1 bath ranch with a carport. The model assigned this home a score of 0.11, which resulted in a classification of a MEDIUM likelihood of selling in the next 12 months (recall, MEDIUM means it is in the top quartile of likelihood to sell, but not in the top decile).
The figure below shows the SHAP analysis for this home, with some different model-in-action dynamics. In particular, notice that this SHAP analysis has red feature segments and blue ones, mainly one large blue piece for bathrooms=0.04. Just as the red segments show the most influential features in driving up the model score for a particular home, the blue segments show the most significant features in pulling down the score. We can interpret the big blue bathrooms segment as saying: if this home had had the average number of bathrooms, instead of a small number of bathrooms (one), then the model would have given it a score of 0.146 rather than 0.11 (and therefore it would have been classified as HIGH LTS). Note that this feature value is not all that much different from the bathrooms feature value in the previous example: bathrooms = 0.04 here and bathrooms = 0.09 in Example #2; both are in the lowest decile of the number of bathrooms. In both cases, the number of bathrooms has the largest effect on the model score. However, here having very few bathrooms drags down the model score significantly. In Example #2, having very few bathrooms increased the score significantly. This provides another clear demonstration of the model’s non-linearity. The effect of having very few bathrooms on the estimated likelihood of selling is different depending on the rest of the home context.
So, what features here increase the model score? The most influential is “tract_ratio_ppsf = 0.033”. The tract_ratio_ppsf represents how the price-per-square-foot (ppsf) of this home (from its most recent prior sale) relates to the ppsf of other homes in the same census tract. The value of 0.033 means that this home sold for a very low price (third percentile in that census tract). From another red feature in the SHAP analysis, we see that years_since_sale = 0. Again, that does not mean that the home just sold last year; it means that the home sold in the most recent time period in the data (the zeroth percentile; note that homes sold last year are removed from consideration). The other red features effectively are different proxies for these same factors.
This suggests that the model estimates a high likelihood to sell for this home because it sold recently for very little (relatively speaking). We can speculate that here the model has captured either a distressed sale or a flip-in-progress, both of which may result in a subsequent sale soon after that. We can also see that the home having only one bathroom substantially decreased the model’s estimated likelihood of it selling in the next year.
As with the two examples above, this home indeed did sell in the next 12 months.
Example LTS Prediction #4
To drive home the extreme non-linearity of the learned model, let’s take a quick look at another home in Atlanta, which, like the home in Example #3, also has the highest values for the feature geo_ppsf (see SHAP analysis below). However, check it out: while that feature value contributes substantially positively for the Example #3 prediction, now it contributes substantially negatively!
Similarly, years_since_sale = 0 as in Example #3, but there it was one of the top positive contributors; here, it is the principal negative contributor. This illustrates that in a different context, the same features can have dramatically different effects. Here these features (in blue) pulled the score all the way down to 0.05. Without the blue “pull down,” the home would have also been above the threshold for a MEDIUM score. The actual final score of 0.05 was below the threshold for any LTS recommendation.
This home did not sell in the subsequent 12 months.
Example LTS Prediction #5
One final example — not to miss! Our learned model scored the next home as being very likely to sell. A property in Loveland, CO received one of the highest estimated sale likelihoods in 2018 (based on the 2017 model this time). The SHAP analysis helps us to understand what features of this home contributed to the high score:
In particular, see here that the model drew on yet a different set of features, including that the probability of birth in this census tract’s prior year was very high (96th percentile). And we know that outgrowing the home is a prime reason for moving.
Perhaps we should not be surprised about the high probability of birth in Loveland, though!
What’s next?
As you probably knew, coming in, applying machine learning to build high-impact business applications is complicated. We hope that this installment of our blog post series has shed more light on some of the details, but there is a lot more to discuss. Next, we will dig deeper into the broad evaluation of our Likelihood-to-Sell recommendation models.
Stay tuned!