Computing at the Edge: On Device Stitching with Zillow 3D Homes

Published in

Zillow Tech Hub

13 min readJun 24, 2020

Co-Authored by Sean Cier, Principal Software Development Engineer, Zillow Group

Zillow’s original 3D Home app for iPhone allowed home sellers and real estate professionals to capture and create interactive virtual tours of homes by using a set of interlinked panoramic images. A panorama could be captured by holding an iPhone steady as you turn slowly in a circle, while the app captures video frames and motion (IMU) data to help guide you. The original app captured 180 frames as the user rotated and would send this data (70 MB or more) to the cloud to be stitched into a panorama. This approach, documented here, produced attractive results but suffered from a few problems:

Speed — It took several minutes or more before the results could be previewed. Sometimes it felt like you were back in the film days, sending your photos off to be processed. That also meant users had to be extra-careful when capturing tours to eliminate any chance of an error, which further slowed down the whole experience.
Connectivity — Many homes on the market are not occupied, meaning they don’t have active WiFi while being listed for sale. Multiple gigabytes of data per home is an awful lot of data to send over cellular, so most users opted for capturing the tour offline. That meant that the photographer couldn’t see the panoramas they had captured until hours later, after they had already left the site and it was too late to capture any new ones.
Compatibility — Being limited by upload data size meant it wasn’t practical to take full advantage of ever-improving cameras that offer features such as high-resolution capture and ultra wide angle lenses.
Scalability — As more and more tours were created, compute demand skyrocketed, increasing costs.
Unprocessed Data — Not having processed data available during capture prevented the application of machine learning and computer vision techniques that could be leveraged to create richer tours.

The evolution of the 3D Home app, and features like integration with a 360-degree camera, meant it was time to revisit the panorama stitching system. We wanted to fully unlock the power of a user’s phone to improve their experience — deliver faster results (within a minute), take advantage of high-resolution and ultra-wide-angle capture features, and work offline.

Computing at the edge

We understood the problem pretty well from the first go-round, and so picking supporting technologies to improve 3D Home was fairly straightforward.

OpenCV: The highly popular computer vision library hasn’t always been at home on mobile devices, but nowadays it flat out hums — providing all the flexibility and efficiency we needed. We considered using platform technologies like Vision and Metal to help speed up processing by taking better advantage of the GPU, and this remains an area where we’ll gradually tweak the implementation, but for the time being, OpenCV has met our needs.

AVFoundation: AVFoundation is the iOS system-level framework for controlling the device’s camera and accessing the raw frame data. While it provides many options for modifying the frame capture frequency and exposure time, we found that capturing at 60Hz balances the need to minimize motion blur from longer exposures while avoiding introducing additional artifacts.

CoreMotion: We needed IMU (accelerometer and gyroscope) data, and iOS’s CoreMotion framework provides this data and filters it into a clean, stable signal. This allowed us to provide an AR-style overlay that shows the user how they’re moving, lets us know when they’ve rotated far enough to capture another frame, and applies heuristics to warn them when they’re moving too quickly or pitching or tilting too far from true. Our goal was to provide enough subtle feedback to help them get back to slow-and-level before they’ve tilted too far for the data to be useable, because forcing them to stop and go back almost inevitably results in small discontinuous movements that show up as ghosts, bent lines, or other artifacts in the final panorama.

We also explored using ARKit as an alternative (or in addition) to AVFoundation and CoreMotion, for its ability to stably track the user’s position and provide real-time feedback on translation. Simply put, the more the user moves while they capture, the more any depth disparities in the scene will result in parallax effects which spoil the imagery. Integrating with ARKit and utilizing that data to improve the experience presented its own set of challenges, though. While this remains an area we are convinced holds promise for future versions of the capture experience, we decided to work with AVFoundation for this version.

That’s an awful lot of data you’ve got there

Our old system captured 180 frames — one per 2 degrees of rotation. The new system has a different set of trade-offs (more on that later) and thrives on smaller jumps between frames. We experimented and arrived at 540 as minimizing processing time while providing results indistinguishable from those produced by higher frame counts. Now, 540 full HD frames — or on the iPhone 11, 540 4K frames — is quite a lot of data to store. Ideally we’d stream and discard it, but that has two problems. First, we want the algorithm to work well even when data is captured more quickly than it can be immediately processed, so we still need to keep a backlog of frames for it to crunch through. Second, as the algorithm uses multiple passes, we need to keep the old data — and in fact save more partially-processed data for each frame as we go.

Back-of-the-napkin math suggested this would be too much data to store, and experiments confirmed it. Modern iPhones allow a lot more memory usage per app than early versions, with recent models allowing just north of a gigabyte. Still, that can’t all be dedicated to storing frame data for processing, especially when you’re stitching in a background thread while the user goes about other tasks in the app. A few hundred megabytes are practical, but once you press past 600 or so, you start running into problems. Even an occasional crash can ruin a user’s day, especially when their job depends on their tools working right.

We addressed this with a multiple-stage pipeline. First, we pick one incoming frame for each “slot”, or 1/540th the full circle. Sometimes only a single frame is captured during that slot, depending on how quickly the user’s moving and how noisy the IMU data is, but often we have 2 or 3 to choose from, so we try to pick the best one using some simple heuristics: smoothest motion, stable focus, etc.

Next, we associate that frame with IMU data from the same period, and crop it into a vertical strip, because we only need the middle fifth or so. We keep more of the width for some frames, specifically the first and last frames. We also retain the width for those where the user had tilted too far and had to go back and recapture a frame. This is because until we’ve actually done the registration passes and then distributed any error in later passes (again, more on that later), we won’t know exactly how much of the width we’ll end up needing, and frames with possible jumps are more likely to need more spatial information.

After cropping, frames get streamed to the linear first phase of the stitching algorithm, namely, image registration. At the same time, an asynchronous operation is working its way through the queue of recently captured frames, compressing them in a lossy manner (we tried HEIC, but that file type was a bit processor-hungry, so we stuck with trusty JPEG), and dropping the uncompressed copy on the floor. Later passes of the algorithm access these older frames by uncompressing them on the fly, with a cache of a few dozen uncompressed frames. The algorithm will also store secondary processed data, including frame buffers, which are also compressed on the fly.

Finally, there are cases where you need to shut down a stitching operation before it can be finished: either because the user backgrounds the app, or they started capturing another panorama. Trying to continue a previous stitch while starting a new capture is a recipe for poor performance, a wonky and unpredictable experience, and a blown memory budget. In these situations, we stream the frames and metadata to storage, which generally takes a couple seconds. Then, when the app is relatively idle again in the foreground, it reinflates these datasets and stitches them, one-by-one. We quickly realized that these stored datasets are immensely valuable for another purpose, as well: recreating a capture for the purpose of tuning and debugging the algorithm. So while the datasets are usually discarded behind the scenes after stitching is complete, there are options in the app to retain them, export them, and send them back to our support folks to help us continue to improve the app.

All of this data plumbing is handled by a dedicated module in the app so that what’s finally passed to the stitching algorithm itself is clean, minimal, and reliable.

Overview of Algorithm

As described above, panoramic stitching occurs in two phases, namely, the initial image registration phase during capture, and the final stitching phase done post-capture. Each phase consists of the following steps:

During capture (registration phase):

Capturing and storing only vertical strips (except for first and last frames)
Computing transforms between adjacent strips

Post capture (final stitching phase):

Computing the transform between first and last frames
Closing the loop by computing the drift error between first and last frames, and distributing it across the other frames
Smoothing the exposure (i.e., reducing high-frequency exposure variation) by finding changes in intensity between overlapping adjacent frames
Smoothing the spatial overlaps to reduce artifacts by blending (feathering) adjacent strips
Cropping by computing bounding box to leave out uncovered (black) regions

Capture Process

The previous cloud-based panorama stitcher algorithm captured 180 frames, or a frame every 2o on average. Since we are no longer uploading input frames to the cloud for processing, we can afford to capture more frames. More specifically, we capture 540 frames, or a frame every 0.67o on average, and process them on-the-fly. Given the spatial proximity between adjacent frames, we can align them using a simple motion model, namely a translational motion model. (The full resolution images are captured, but downsampled versions are used for alignment.) The proximity also allows us to use a cropped version; the cropped image is a vertical strip from the middle of the original frame, with the strip width being a fifth of the original. This reduces computational and memory costs. The figure below shows what is used for generating the panorama; here, N = 540.

If we just concatenate thin central strips of the input frames without any alignment or post-processing, we get:

Computing Transforms Between Adjacent Frames

The motion model between adjacent frames (except for that between the first and last pair) is just that of translation. This is a reasonable approximation given the small motion of 0.67o; in fact, this algorithm is akin to assuming a pushbroom camera model. A pushbroom camera consists of a linear array of pixels that is moved in the direction perpendicular to its length, and the image is constructed by concatenating the 1D images. Note that such an image is multi-perspective, because different parts of the image have different camera centers. Satellite cameras are typically pushbroom cameras used to generate images of the earth’s surface. In our case, we are constructing the panorama using a thin swath from each image, with the exception being the last image. This is due the possible large motion between the first and last images.

The translation is computed using a direct method in OpenCV. As an image is captured, it is downsampled (once) and cropped. It is then registered with the previous cropped frame. The figure below shows the relative transforms computed: t10, t21, …, tN-2,N-3, and tN-1,N-2. Note that these transforms are computed on-the-fly as images are captured. If there is insufficient texture in the images (e.g., the images are of a textureless wall), the motion defaults to a horizontal translation equal to the theoretical shift given the camera focal length.

Next, these transforms are concatenated to produce absolute transforms t10, t20, …, tN-2,0, and tN-1,0. This is used to estimate the length of the panorama in pixels.

The first and last frames are a special case. Their transform is critical for loop closure. Since the motion between them can be significant, whole frames are used for registration. We use (2D point) feature-based registration instead of direct dense registration. See the figure below for an example, which produces the homography (2D perspective transform) HN-1,0. The red crosses are the detected corresponding points, and the green line segments correspond to motion from one frame to the other prior to warping. Since we care more about closer alignment at the center of the images (we blend pixels in the central vertical strips), we cull corresponding points located at the image periphery.

Distributing Errors for Loop Closure

It is highly unlikely that the concatenated translation that maps first to last frames (tN-1,0) and the full-frame homography (HN-1,0) correspond. For loop closure, we will need to update the concatenated transforms for frames I1, .., IN-2, such that the concatenated transform for IN-1 is consistent with HN-1,0. To do this, we first compute the errors in transforming the corners of the image as shown in the figure below.

The shifts in the transformed corner for each concatenated transform are adjusted by dA/(N-1), dB/(N-1), dC/(N-1), and dD/(N-1). The adjusted transforms are then updated by computing the homographies that result in the adjusted corners.

Smoothing Exposure Over Frames

Since the camera auto-exposes as it is manually rotated, there may be significant changes in intensities between nearby frames. An example of a composited panorama without accounting for exposure changes is shown below. These vertical intensity-based artifacts are evident, despite using a blending algorithm (described in the next section).

For each pair of adjacent images, we use the computed motion to find overlap between them. The average colors in the overlap regions are computed, from which intensity ratios are computed. These ratios are concatenated across the panorama, and adjustments are made for loop closure in a similar manner as the spatial transforms. The difference is we make use of anchor frames, where the original intensities are preserved. This helps to significantly reduce the problem of color drift. In our case, we use five anchor frames spaced equally along the panorama. A result of implementing this algorithm is shown below.

Using Optical Flow to Improve First-Last Frame Registration

Unless the capture is on a tripod, there is usually a shift in viewpoint between the first and last frames. To mitigate the blending artifacts in loop closure, we apply full-frame optical flow between these images, and warp them towards each other. This step improves the visual quality unless the shift is too significant. Below is the effect of applying optical flow for loop closure (left: without optical flow correction, right: with optical flow correction). The ghosting artifact is substantially reduced.

Examples

Kitchen (iPhone 10):

Narrow hallway (iPhone 10):

Living room (iPhone 11, ultra-wide mode):

Conclusion

Our on-device stitching system is a good tradeoff between speed of execution and output quality. (The best results are obtained if the device is manually rotated on a tripod during capture.) The key to accomplishing this tradeoff is the application of the pushbroom concept associated with dense capture, which simplifies the process of pairwise frame alignment and subsequent panorama stitching.

Compared to the previous cloud-based version, our on-device stitching algorithm is more light-weight. For instance, a simple shift motion model is used to align consecutive frames during capture; to handle strong parallax, a more complex algorithm would be required. By comparison, the cloud-based version uses optical flow for much more effective reduction of blending artifacts for all the frames, but this step is significantly more compute-intensive. In addition, a simpler version of exposure averaging is used to reduce the effect of temporally changing exposures, and it may be less effective in handling very rapid exposure changes.

The future brings new challenges, of course, some of which we’re already tackling: supporting our system on more devices, taking advantage of exciting new improvements in camera systems and sensors, and finding new ways to help understand the scene at a higher level even while the user is capturing panoramas.