Framing Feature Engineering for Machine Learning: A Generative Model of Homeowners’ Likelihood to Sell

Published in

Compass True North

10 min readMay 10, 2022

Foster Provost & Panos Ipeirotis

TL;DR: We often hear about the 80/20 rule in machine learning: 80% of the work is getting the right data in order. Having a generative or causal model in mind for “business understanding” helps guide us through the vast data jungle. Such a model can direct our investments in data acquisition and feature engineering, to generate positive returns on these (monetary or engineering) investment.

Note that in this blog post, we will talk about features that could be engineered to estimate a home’s likelihood of selling. We are not saying that we actually use any of those features (nor that we should use any of them).

Feature engineering and data acquisition

Contemporary discussions of machine learning (ML) often focus on successes at image understanding, natural language processing, and game playing. These applications can give us a skewed view of feature engineering. In particular, these are applications where feature engineering is necessary — done either manually or by complex machine learning algorithms and architectures. However, in most cases, for these applications, all the data that is necessary is present: the current game board, the image itself, the text passage.

Many business applications do not exhibit this characteristic; we do not have all the relevant or useful data when we start our work. Part of the work of doing machine learning is deciding on what data to use or what data to acquire, possibly at a cost — a monetary cost, an analysis cost, an engineering cost, or all of those. For this sort of application, it is crucial to make sure that we do not fall into the trap of thinking of machine learning as “mining the data.”

“Creating the data” is often where we need to start.

Let’s think of data as an asset that we would like to (try to) get value from via machine learning. Viewing data as an asset can change our perspective: as with other assets, we have some already that we possibly could get value (“return”) from. But we also can invest in assets that we don’t currently hold. We then have a subproblem to solve: what data should we invest in? Can we (somehow) estimate that we will get sufficient return from a new data source, to compensate for the cost we will incur to acquire it — and to engineer features from it?

Conceptual models and machine learning models

Machine learning usually is used to build “models” that estimate some target quantity of interest. For example, our first business-growth recommendations in the Compass CRM are based on a model that estimates each homeowner’s likelihood to sell their home. These are statistical models of the relationship between some predictor variables (features) and a target variable (often a not-yet-known outcome). For our likely-to-sell (LTS) modeling, we use features of the home and the homeowner (let’s call that home+homeowner) to estimate the likelihood that the homeowner will sell the home in the near future.

Best practice for solving problems with machine learning includes investing significant effort in understanding the phenomenon being modeled, especially to guide the selection of, engineering of, and investment in the data that will be used for prediction. ML practitioners loosely call this “feature engineering.”

Feature engineering ideally should be based on some theory — a conceptual model — of what quantities ought to be correlated with the quantity being predicted and why. More deeply it should be based on a conceptual model of the data-generating process and the actual (causal) drivers of the phenomenon being modeled and the quantity being estimated. (This is not to say that the machine learning here is doing causal modeling, but rather that we should have a model of the causal process in our heads when we are engineering features and building models.) These conceptual models — different from the statistical models that we will build from the data — then can be used to inform feature engineering.

Here is a picture of the model that drove our feature engineering:

Causal models of the data-generating process can be very complex, but even simplified approximations can help guide the feature engineering process. In what follows, we will explore a complex causal model for the drivers of homeowners’ likelihood to sell. The full model, depicted above, is complex, but we will analyze it systematically before putting it all together. After we do that, the overall model should be understandable.

A Conceptual Model of the Influences on Likelihood of Selling

One of the most common and most intuitive frameworks for modeling causes and effects is with graphical models, where quantities in the world are visualized by nodes in a graph and direct causal influences are represented by directed edges in the graph.

The highest-level influences on the likelihood of selling

The figure below presents a graphical model showing the five highest-level factors influencing likelihood of selling. We will elaborate on these factors below, but first let’s consider this high-level view.

High-level factors influencing a homeowner’s likelihood-to-sell (LTS). The U(.) terms are utilities, that is, the multi-factor value that the homeowners realize. (“Multi-factor” meaning beyond the strict monetary value of the home on the market.) The terms are: U(home), U(other homes the homeowner can afford), Affordability of the current home, Alternative investments/uses of equity in the home, and resistance to change.

Let’s go through these five factors.

The first factor is the “utility” of the home. “Utility” is a term from economics that encapsulates all the different sorts of value one gets from something. We use “utility” here instead of just “value,” to highlight the non-monetary value of a house. For example, it is particularly convenient for the homeowners to get to work (in different directions); it is also close to the children’s school; and it has quick and easy access to the best road-biking routes in the area. Those are all very valuable to some people (high utility) but they may not necessarily reflect into the monetary value of the house.
The second factor influencing likelihood to sell is the utility of other homes that the homeowner can afford. The house may not really match the needs and desires of the homeowner, but they may not be able to afford a home that matches better. In that case, they may not have a high likelihood of selling. On the other hand, if their financial situation has changed markedly for the better, and the homeowners can now afford a home that has higher utility, then they will be more likely to sell.
The third factor is the affordability of my current home. The house may have high utility, but the cost of mortgage and maintenance may make it unaffordable. This may be because of changes in the status of the homeowner, because the purchase was not well thought through.
The fourth factor is whether there are better ways to invest the equity in the home. Alternatively, the homeowner may want to downsize a little and use the funds for other purposes (invest in a vacation home, start a new business, travel the world, …).
Finally, the fifth factor encapsulates that different people have different inherent resistances to (or loves of) change, and resistance to change can change, based on circumstances. For example, life events like marriage and retirement not only change the utilities of specific homes, but also change individuals’ resistance to or desire for other changes — including changing homes. What’s more, we may be able to reduce the resistance to change by reducing the information uncertainty, by providing examples of alternatives that are clearly better than the current status quo.

Certain things might affect multiple influence factors, and so these things may be particularly useful to acquire data about. Certain life events (e.g., marriage and retirement) not only affect the utility of the home and the resistance to change, but possibly the affordability of other homes as well as the other investments that one might want to make. So if we could create features indicating that or correlated with the fact that a homeowner has (likely) gotten married, that could be very useful in predicting a higher likelihood of selling.

What influences the influences?

The beauty of the high-level model just discussed is that it is simple and very intuitive, and also immediately suggests possible features to acquire or engineer, as illustrated above. However, its simplicity has a drawback as well. It is not as useful to help us be comprehensive: if we use this to brainstorm possible features, how sure would we be at the end that we didn’t miss entire categories of important features.

Fortunately, the graphical model framework elegantly enables us to address this, by “recursively” asking: what influences the influences? Can we take each of the five high-level factors and expand it, drawing out as many dimensions of influence as we can?

What makes a home more or less valuable to its owner?

Let’s take the first factor and expand it. The next figure shows five main drivers of the utility of the home to the homeowner. The five categories of drivers of the utility of the homeowners’ home are: The features of the home and the neighborhood, the interests of the homeowners, family/personal needs, professional needs (like a home office), and the value homeowners place on novelty. These factors interact with each other, which is indicated by the x in the circle. For example, family or professional needs are more or less satisfied by the features of the home and neighborhood.

The utility of the home to the homeowner and the factors that influence it. The x in the circle indicates that these factors interact with each other. For example, family and personal needs are satisfied to a greater or lesser extent by home/neighborhood features. In some cases, these have further factors that influence them that are worth calling out. For example, the particular needs of the family are driven significantly by family composition (or plans in that regard).

In some cases, these next-level factors have further factors that influence them that are worth calling out. For example, family needs are influenced by relationship status and family composition. In the extreme, someone inherits a house and has no utility for the home beyond its monetary value. In that case, the likelihood of selling is almost certain.

So, we can envision lots of different features we could engineer: home size, number of bedrooms, school district, the homeowner just changed jobs, the homeowner is a cyclist, etc.

A homes’ utility also changes when the homeowner changes jobs. Could one engineer features to capture this change in utility? Possibly yes, as certain third-party companies provide information on consumers changing jobs.

Thinking about engineering features such as this immediately brings to the fore the fact that we may not have data on many of the things that we will discuss throughout this post. However, we may be able to acquire such data, incurring some costs in the process. For example, depending on the state, marriage filings are often part of the public record.

What makes other homes more or less valuable to its owner?

Ok, let’s take the second factor in the high-level model and expand it. The next figure expands the “Utility of other homes the homeowner can afford.”

The utility of other homes that the homeowner can afford is affected mainly by what other homes the homeowner can afford or expects to be able to afford, which is a function of the homeowner’s wealth and income. For example, across the population of homeowners, a person’s wealth likely correlates strongly with the value of their home. In fact, for many homeowners, their primary source of wealth is the equity in their homes. Also, wealth typically increases over the years, and this may correlate moderately with the time the homeowner has been in the home, or with the age of the homeowner (if available). Another potential source of information is an event about a sudden increase in wealth: A big promotion, a sale of a company, an IPO, and so on. Features engineered from wealth events could be quite predictive of subsequent home sales.

What keeps a homeowner from wanting to sell despite seemingly better utility?

Different homeowners have different psychological attitudes toward change. Home equity itself may drive some such effects, like a fear of missing out if friends and neighbors are cashing in when the market values homes highly.

Finally, the likelihood of sale is dependent on the homeowner’s inherent resistance to change, which is affected by various psychological effects. One key driver of such psychological effects is the equity in the home (recall the discussion above of the fear of missing out, as an example). Thus, home equity could be a key feature in a likely-to-sell model, and we could think about engineering more complex features based on home equity.

Can a real estate agent or platform affect a homeowner’s likelihood to sell?

Real estate agents and real estate platforms can themselves affect a homeowner’s likelihood to sell, by revealing the availability of high-utility homes, and also by helping them to understand their own “utility functions.”

A final factor that affects the likelihood of homeowners selling their homes is whether they are aware that there are alternative homes out there that would give them higher utility. Real estate agents and well-designed real estate technology platforms should therefore be able to themselves affect a consumer’s likelihood to sell, by revealing the availability of homes that would provide the individuals with high utility, and also by helping them to understand their own “utility functions.”

A self-fulfilling prophecy: Agents and AI

While the last part is not directly a “feature” for likely-to-sell modeling, it does suggest that the very idea of providing real estate agents with recommendations may be self-fulfilling: our generative model predicts that agents talking with individuals likely to sell may act to increase their likelihood of selling! (This was definitely the experience of your author: once we started working with our agent to explore the possibility of buying new homes, we got increasingly excited about the prospect of buying a new home.)

Beyond feature engineering, this last point has implications for the relationship between residential real-estate platforms/agents and overall consumer welfare. The machine learning model is symbiotic with the agents, who become an integral part of the process. The agents who actively engage with the platform end up shaping outcomes that are better than expected by the AI alone.