How Opendoor processes over 500 videos each day

Building Enricher: A computer vision based tool

Shashwat Srivastava

Published in

Open House

8 min readApr 5, 2023

By Shashwat Srivastava, Senior Software Engineer

If you’ve ever bought or sold a home before, you know how hard it can be. The traditional process can be fraught with complexity due to dozens of steps, multiple middlemen, and months of uncertainty. On the backend, it can be just as complex: transacting on a home is operationally intensive. However, Opendoor has developed tools and technology to reduce this burden on our operations teams and improve the customer experience.

As a result, we provide a simple and certain way to sell a home, where you can request a cash offer in minutes and close in weeks. Here’s how it works:

A seller visits Opendoor.com and requests an offer. Behind the scenes, our proprietary algorithm compares individual features for hundreds of comparable homes, considering everything from home details and recent listings to current market conditions and broader economic trends. In addition, our human expertise validates local market nuances.
Then, in as little as three minutes, a customer receives a preliminary offer for their home. They can review the offer on their own time or with an Opendoor representative who will explain what went into it.
An Opendoor expert will visit the home to make an assessment of the exterior and take a quick walk through the interior. Opendoor will then finalize our cash offer for the home.
If all looks good, the customer accepts! 🙌 Now the operational hard work begins.

Virtualizing our Home Assessment Process

While I could dig into all of the operational efficiencies we’ve created, I’m going to focus on the assessment process and explain how we built Enricher: a computer vision-based tool used to enrich videos of home assessments.

One of the core facets of the Opendoor experience is to make selling a home as convenient and easy as possible. At the onset of the pandemic, Opendoor had to think through how to make the assessment process contactless. As a result, we virtualized our home assessment process: customers could now record a walkthrough of their home using their phone and upload it to their Opendoor dashboard. We used that video to make decisions on the final offer. Customers loved the flexibility of the new process. They could do the assessment on their own timeline and quickly (it takes as little as 10 minutes!), not worry about strangers coming into their home, and have more control, highlighting what they think is most unique about their home — all while preserving the quality of our assessments.

So how does Opendoor process all the videos and extract valuable information in a timely manner? Our v0 version of this process required human input to go through the video and find representative screenshots for each of the major aspects — this included photos for each bedroom, bathroom, and kitchen, etc. It was manually rigorous and required Opendoor employees to go through the entire video, take screenshots, and label each screenshot by hand.

We thought: how can we do this better? Combining data science and engineering, we did just that. Introducing Enricher — a computer vision based tool used to enrich videos of home assessments!

Using Data Science for Speed

Our Senior Scientist, Nelson Ray, who worked on this model often quotes AI pioneer Andrew Ng: “if a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.” And this sentiment represents our inspiration for developing Enricher.

The classification being done on each screenshot from the video consisted of two parts:

Labeling the type of room in the video (is it a bedroom, bathroom, kitchen, etc.)
Trying to score it on condition (is it in good shape; does it need repairs?)

Given the advances in computer vision, our thesis was that this should be a fairly simple task.

Creating the model

We decided to use fastai (a python machine learning library) to generate the ML model which would classify rooms within homes. fastai is a library which allows us to very quickly generate and customize models. We used resnet34 (an open source computer vision model) and trained it on labeled images of rooms we already had. This very quickly gave us a model which was really good at classifying rooms.

The basic solution now is to:

Get the assessments video
Slice it using ffmpeg — we decided to capture 5 frame for each second
Score each frame using our models
Smartly choose a sample from the images which we believe capture the home

We tested this by running this in shadow with our existing solution and collecting feedback from our operations team. The feedback was generally good and produced labels that helped as much as the manually labeled images. After running this for a couple months and feeling confident about this solution, we decided to roll it out to production.

Handling images of persons

Some of these videos contain the faces of our customers. To overcome this, we generated another model trained on people’s faces to detect if there is a face in the video. At the onset of scoring any frame, we first run this model, and, if a face is detected, we discard the frame completely. This ensures we never capture people’s faces.

Overall, we run three models on each image which tells us if there is a person in the screenshot (we discard these photos), which room we think the photo represents, and what we believe is the condition of the room.

Using Engineering for Scale

High Level Design

At this point, we had the models and a solid plan on how to handle processing a single video. The main issue with this approach was that it can take fairly long to process an entire video — some back of the envelope math suggested about 1s for each second of video. Each video we receive can be anywhere from 15 to 45 minutes long, and when you’re processing ~500 videos a day, it can take a long time: on average, 250 hours a day (30 minutes * 500 / 60 ). The obvious solution is to distribute the workload across multiple machines. But how?

At Opendoor, we use Apache Spark for most of our computational workloads and decided to use that for parallelizing this workload because we already possessed the infrastructure and expertise. Below is a diagram which roughly outlines the Spark plan we ended up creating.

This screenshot is from our design doc for this project. The propertyID and assessmentID columns make the primary key for each video and are the initial inputs.

One of the big engineering obstacles was how to represent the images after slicing them using FFMPEG. The options are to:

Save them all to s3 (very, very expensive from a data transfer perspective)
Broadcast all images to the master node where we can use the builtin Spark “image” format
Save the images as bytes and make them a part of the Dataframe

We decided to go with the approach of converting the images into bytes and storing them in the Dataframe itself as this was the most cost effective and easiest to reason. Once the images are scored and we have our representative sample, we simply save the images to s3 and kick off multiple downstream jobs.

Implementation details

For each step, we used the pySpark API and `dataframe.rdd.mapPartitions` extensively. This is a method which allows us to perform almost any Python operation to each row of data. The data in the dataset can be partitioned appropriately, and each partition can be processed on a different worker node, which essentially parallelizes the entire workload. The number of partitions roughly controls how many concurrent tasks run, which can be especially useful when calling other services so we do not overload them. In addition, the number of output rows from this method do not need to match the number of input rows. For all of these reasons, this method can be repurposed for almost any distributed computing.

For the first step in our implementation, each user will download the video from s3 one at a time, slice them using ffmpeg locally, and produce a new row for each image containing the bytes of that image.

The next step uses the same method to process each image in parallel. It runs the pre-generated computer vision model against each image. This is where things get really interesting. Instead of processing one image at a time, using this technique, we process 1,000 images at the same time! And, this number could easily be higher — it’s bounded by how many machines you are willing to use. The last step selects a representative set of images from all the scored images. And voila, we’re done!

Using Spark has worked really well for us. We have the scale to process large numbers of videos with very little effort — processing more videos simply requires us to horizontally scale our clusters, allowing us to process any number of videos and setting us up for future growth.

Pro Tip: When running Pytorch model classifications, Pytorch tries to maximize CPU thread usage which can lead to issues with resource contention. To get multiple classifications to run in a sane manner on the same worker node, we had to simply add the following code. Leaving it as a snippet for any other engineers running into this issue:

# By default pytorch tries to use all the existing CPU on a machine. This leads
# to intense resource contention - the result was scoring images went from taking
# 30ms to close to 30s when done in parallel on Spark workers. This code is needed
# to allow scoring images in parallel in Spark
from fastai.torch_core import set_num_threads # pylint: disable=import-outside-toplevel
set_num_threads(1)

Results

Overall, this is a great example of how recent leaps in machine learning are helping to modernize operations for the housing industry and improve the customer experience. Developing Enricher ended up being a success for us. We were able to hit our original aggressive timeline thanks to a lot of help from our colleagues in Operations and active collaboration between all teams involved.

This system resulted in huge savings for us as we were able to get more standardized data from our videos. On top of that, we also delivered a delightful customer experience. We believe the future of buying and selling a home is on-demand and online. Every update we make, every tool we build, however big or small, is to bring us a little closer to making that future a reality.

Want to learn more about working at Opendoor? Check out our Product Management and Engineering and Data Science blogs.

How Opendoor processes over 500 videos each day

Building Enricher: A computer vision based tool

Using Engineering for Scale

Written by Shashwat Srivastava