What Every Machine Learning Company Can Learn from the Zillow-pocalypse

Dilshan Kathriarachchi
11 min readNov 26, 2021

--

Zillow has lost a reported $381 million after a data science model went rogue, according to Bloomberg.

Last week, the company indicated it will have to let go 25% of its workforce. Zillow is also trying to divest over 7,000 homes with a combined asset value exceeding $2.8 billion that it had acquired.

For most people, this will be their first exposure to how even routine machine learning models can go horribly wrong when mismanaged.

This article will explore the sequence of events leading up to Zillow Offers being shut down, including bad decisions and structural flaws (not unique to Zillow) that should be a lesson for companies embracing machine learning.

Zillow Offers powered by predictive models — Source: Zillow.com

Over its 16 years of existence — sitting on the most enviable dataset in modern real estate — Zillow had finally gotten its data science practice up to a confidence level where it was able to launch its most disruptive product : Zillow Offers.

A prospective homeowner listing their house would trigger a valuation by Zillow’s price forecast model based on their predictive Zestimate price. If the model predicts future value growth, then one of Zillow’s sales representatives would call them with an offer to buy their home at a predetermined price. Sellers get a near instantaneous sale. Zillow would then time the market — acting as both seller and agent using the model’s forecast — to offload at a profit.

Executed properly, Zillow would have had a giant arbitrage play in the real estate market, unmatched by any REIT in the long term .

Except, they didn’t.

Real Estate agent Sean Gotcher (@seangotcher) explaining iBuying on a TikTok video that went viral in September 2021

These programs are generally referred to as iBuyer products — where machine learning price forecasting models trade single-family homes as an asset class. Zillow is not the only iBuyer in market: Redfin, Opendoor and OfferPad also have similar programs.

Commonly referred to as an Automated Valuation Model or AVM — these machine-driven models have been around for well over a decade. They’re the valuations underpinning both home mortgages and home insurance. The work EQ Works does in insurance, puts me in very close proximity to the output of these types of models. The partners we work with — OPTA Intelligence for one — have built and tested their models responsibly just to avoid the kinds of outcomes that Zillow experienced.

Duck Test

September 2021 began with many complaints from realtors about Zillow Offers (and competitor Redfin’s RedfinNow) program. The most notable of these being Sean Gotcher’s viral TikTok video, where he calls out iBuyer business models as unsustainable, which has now amassed over half a million views.

The conversation around it got heated. Redfin CEO Glenn Kelmann actually took to Twitter in response to Gotcher’s video, essentially denying his claims, and implying the criticism was coming from brokers who didn’t want to be paid less commission.

Redfin CEO Glen Kalman on Twitter
Redfin CEO Glen Kalman on Twitter (thread continues)

Things clearly started going south over the past few months, when the efficacy of Zillow’s pricing forecast started tanking. Twitter is littered with tweets from people selling homes to Zillow, only to see their same properties re-listed below purchase price.

Twitter use making a quick $200,000 profit on the Zillow’s broken Offers model

What’s really shocking here though, is that despite clear efficacy problems in the model , Zillow went ahead with over 9,900 home purchases in Q3–2021 alone through Zillow Offers.

This is an asset purchase well above one billion dollars. It was reviewed by data scientists, engineers, experts in real estate and corporate custodians running a multibillion dollar public company. Yet it passed all these checks and balances without encountering any meaningful resistance.

Instead of slowing down the Offers product — perhaps taking it offline to analyze and learn from the real-world performance they were seeing — from the second to third quarter of 2021 there was a tremendous acceleration, amplifying Zillow’s financial exposure to the systemic issues of their deployed model. This was, without a doubt, a failure: both in data science best practices, and corporate leadership.

From Q2 to Q3 2021, Zillow’s exposure to the faulty model had exploded. Source: Zillow Investor Relations

I wasn’t referring to any formal statistical test when I said Duck Test (unlike the Dickey Fuller test which they probably should’ve used when dealing with housing price data.) No, I was instead referring to the very common act of oversimplifying and making anecdotal conclusions on machine learning models, which leads to extremely dangerous scenarios. If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck — unless it’s actually a mechanical duck automaton.

Magical Predictions

Any sufficiently advanced technology is indistinguishable from magic.
- Clarke’s Third Law

So, why did no one question the model before it was too late?

The answer to that is likely in Arthur C. Clarke’s Third Law. Once a technology gets sufficiently advanced, relative to a given audience, it, inextricably, becomes indistinguishable from magic. Clearly, the Zillow Offers model that drove all of these destructive decisions at Zillow had become magic to the corporate leadership. And, it’s very difficult to govern and make decisions about anything magical

That’s a problem that will extend further than Zillow — much further. In fact, it’s a problem that will touch every space where machine learning is deployed.

The humble neural network, the foundation of all modern data science and machine learning, has a lot to do with this. Neural networks — despite underpinning the modern digitally connected society and influencing vast swaths of the economy — are largely not explainable.

Modern AI has become a black box, where while we can do limited tests to evaluate if a machine learning model is making good predictions, we can’t understand why it’s making them. This opaqueness can hide huge issues until it’s too late.

An example of a simple neural network with 5 hidden layers and a handful of neurons

Zillow’s most popular home pricing model is the Zestimate. This is the model you see when you fire up the Zillow app and navigate to a listing. It was developed with a crowdsourced contest on Kaggle with a bounty of $1.2 million called the Zillow Prize. 3,770 competing teams gave Zillow a license to use their work in exchange for a shot at the prize. The public nature of this competition meant that the underlying techniques that started the Zestimates model are well understood to be stochastic. But, what is unclear is to what extent Zillow Offers actually leveraged the Zestimates model.

Zillow publishing as early as June 2021 of decreasing the error rate to 6.9%, which is very impressive While not authoritative by any measure, there’s been a lot of individual claiming to be ex-Zillow data science members indicating that there were in fact two models in play — one neural network based stochastic model for Zestimates and a separate possibly non-stochastic model for the Offers product. Why would Zillow use a separate model for their iBuyer program? It could very well be that the Zillow Offers business model (or having to explain the decision tree for legal compliance) required a time series forecast which Zestimates is not designed to do.

Zillow would then need to forecast out the trajectory of a Zestimate in order to make a purchasing decision. You’re now modelling a forecast on top of a modelled prediction, the Zestimate. Doing so significantly increases your risks of compounding errors in both models. This alone would’ve made the challenge before the Zillow Offers team daunting, and also admirable.

The only public insight into the actual Zillow Offers forecasting model (that we have so far) can be gleaned from some of the company’s job postings for the Offers data science team. This is the first hint that we get of what commonplace, but often flawed, hiring practices may have contributed towards the catastrophic failure at Zillow.

Hiring for the Wrong Skills

Every machine learning model, both stochastic and non-stochastic, is only as good as the teams and individuals that trained, tested and deployed it. Looking at data science job postings for the now defunct Zillow Offers team you see the first sign of trouble for the Offers team.

Job listing for Zillow’s now defunct Zillow Offers data science team. Source: Glassdor via ryxcommar

While this has a lot of the boilerplate data science job description, the key term to note is the strong emphasis on competency with Prophet, a time-series forecasting and analysis tool from Facebook (research paper). This is a python package that has grown in popularity by making powerful, non-stochastic forecasts involving time-series data quick and incredibly easy to do. This is a tool I love, and whether its measuring the trajectory of a marketing campaign, or analyzing a lead generation pipeline — it’s one of my go to tools.

While Prophet has lots of tools to model in seasonality, change points, and outlier management — it’s ultimately a curve fitting algorithm. There’s absolutely nothing wrong with this, and I’d even argue this is the most powerful and elegant aspect of regression analysis.

Given the job posting from Zillow, it is reasonable to assume that the Offers team disproportionately used Prophet (or similar SARIMA models) to forecast real-estate prices in the future based on historical data. However when you look at the time-series trending of the US (and Canadian) real-estate markets since 2013 to present, you’d be hard pressed to find a training dataset appropriate for this type of technique.

Housing market from 2000 onwards. Source: S&P Major Markets House Price Indices

An obvious issue with using a (S)ARIMA model with residential real estate is that it would not be able to capture the true complexity of the market.

Similarly, when everything is going up, and trailing recent historic data tells a similar story, you quickly end up with a speculative price forecast model that’s not able to handle price volatility.

The over emphasis by the Zillow Offers team towards Prophet or any other highly opinionated framework sets a dangerous precedent. Teams can easily become framework-centric as opposed to solution-centric.

That’s why it’s important for a team to cycle through very different frameworks and methodologies in the lifecycle of a machine learning product. Elevating any single framework like we see in the job posting can quickly lead to confirmation bias around how the problem gets solved. This has a material impact on diversity of thought and the types of strategies the Zillow Offers team explored in developing the model.

Unaccounted for Adverse Selection

Alongside these other issues, Zillow also chose to apply a premature machine learning model towards one of the most difficult domains in existence.

Unlike the price of airline tickets, the housing supply is very inelastic. Unlike prices of products on e-commerce sites, houses are driven by emotional attachments to thousands of variables resulting in no suitable substitutions–one-of-a-kind thinking. Unlike digital marketing, you don’t have a clear feedback loop allowing the model to build in causality. Deals fall through because of hundreds of opaque reasons including financing, an appraiser’s valuation, or simply someone changing their mind. Even when a feedback loop is possible, the closing timelines make that cycle anything but real-time.

While machine learning models have solved for pricing in some markets, more complex markets like residential real estate remain very difficult. Even with Zillow’s scale of data, we still don’t have enough data to handle all the corner cases. Deploying such an ill-prepared model prematurely to the residential real estate market likely exposed Zillow to a huge amount of adverse selection risk.

Adverse selection is when one party in a transaction, say the buyer of insurance, has an information advantage over the seller: for example, that they don’t always change into their winter tires. This information gap creates a material difference as to whether the insurance company (the seller) would have completed that transaction.

Given the complex nature of the market that they were operating in, paired with a considerable feedback delay, Zillow’s Offers model would’ve been consistently at a disadvantage regarding adverse selection with the seller. This built-in structural problem is quite possibly what ended up being the ultimate downfall of the Offers product.

Takeaways to Move Forward With

Looking back, it can be easy to dismiss these issues as Zillow-specific. However, if we don’t pay attention, this could become all too familiar. Now, more than ever, companies are bringing machine learning models into production workflows — often, like Zillow, replacing established processes and markets.

As the dust settles on the Zillow-pocalypse, here are some lessons we can learn, to avoid making the same mistakes.

Model Governance is Corporate Governance

The first takeaway is that corporate governance needs to extend into model governance. For the longest time, corporate governance was about managing the human side of risk. However, as humans increasingly get taken out of the loop, governance that focuses only on the people training and managing the models is not good enough. This is why I hate the use of phrases like “solve X with machine learning” in the corporate vernacular. If they are to responsibly manage the introduction of machine-led decisioning into their businesses, Corporate leaders need to become well versed in the intricacies of machine learning. No more magic.

Hire Problem-Solvers to Solve Problems, Not Use Frameworks

That isn’t to say that there shouldn’t be a focus on the people building the models — a lot of thought needs to go into hiring, and then empowering them. These are the individuals that will build the models that will become a business’ moat. They need to be able to make proper strategic decisions as experts in their domain.

It would be great to see a shift away from listing out specific frameworks or technologies in job descriptions, and going one step lower in detail to list methodologies or types of models that the team values. The former approach culturally locks in a framework from hiring all the way to production, while the latter provides a starting point, but still promotes a diversity in approaches. You can read how my team at EQ Works hires — we encourage everyone to copy, borrow and steal our hiring methodology.

Source: Unsplash

Be Realistic About Your Model Capabilities, and Test Them

Lastly, we need to stop using machine learning as a one-size-fits-all silver bullet to every problem. Like Zillow learnt the hard way, there’s a lot of complexity in problem spaces. Markets are incredibly complex, often in deceptively hidden ways. Solutions that work in one setting, abjectly fail in another. Finding good solutions requires testing, learning, and flexibility.

This is by no means a pessimistic take on companies using machine learning to disrupt the status quo. The Zillow Offers team is incredibly brave in charting new ground and taking on an incredibly difficult challenge. It’s an acknowledgement of machine learning’s coming of age, and finally needing responsible governance. Zillow is an expensive, but necessary lesson for the whole industry to move beyond the current plateau

I have no doubt the Zillow Offers data science team will walk away from this and go on to solve bigger, harder and even more impressive problems. They are now the custodians of some of the most valuable learnings necessary to build the next generation of machine learning businesses. The rest of us, as observers, should glean as much knowledge from what happened at Zillow and fix our own gaps in governance, structure and hiring.

At EQ Works, my team and I have been testing, adapting and iterating our own approach to bringing machine learning to production. We constantly learn from our mistakes, and from our peers like Zillow. Some of our learnings, constantly evolving engineering culture and best practices can be found on our engineering site.

--

--