Processing Millions of Listing Images Asynchronously

How we improved our image processor with fear-driven programming

Published in

Compass True North

9 min readMay 22, 2017

A few months ago, we dramatically improved throughput and error handling of our image processing pipeline (to process Real Estate listing images). But first, a flashback to a little over a year ago.

Not Again!

7:15 PM: It’s happening again. Reports of missing listing image updates are flooding in. I see Finagle networking exceptions and signs of deadlock in one of our backend Thrift service’s logs. This is the latest in a series of image processor issues. The late night debugging compounded with my failure to solve the problem has led to a dangerous rise in my stress levels over the past few months.

Screenshot of our Slack chat from that night

I reach out to our DevOps team to help debug the issue. We go back and forth in our incident-specific Slack channel, #war-room. Someone tracks down the root failure. Requests to our image processing service — the service responsible for downloading, resizing and uploading listing images to S3 — are timing out … again. Our load balancer, client, and server timeouts don’t match. The number of image processing issues we’d fixed in the past few months had fatigued us but left us open to more radical solutions.

Original Listing Data Pipeline

Our listing pipeline pulls and processes listings, with their images, from APIs provided by third-party data providers called Multiple Listing Services (MLSs). It includes four major steps, included in the below diagram of our listings pipeline:

Transforming listings from MLS data schemas to our own.
Geo-locating listings by address. We use a combination of census data and Google Maps to connect each street address to a latitude and longitude.
Copying and resizing images for presentation in our Search, Listing Page, and Collections products.
Combining multiple listings for the same home into a unified view.
If a home has sold multiple times over a few years, we want to aggregate atemporal data into a single object (not included in below diagram).

Listings Pipeline with Old Image Processing Flow

The initial version of the pipeline used a synchronous API provided by our image processing service for step 3. This image processing API accepted requests containing a listing object with image URLs. For each request, the service downloaded images, resized them, and uploaded them to S3.

Scalability Woes

As we scaled with more MLSs, this synchronous image processing system slowed down the rest of the pipeline and broke more often. Processing an entire MLS listing dataset or large hourly incremental updates from multiple MLSs could overload our synchronous image processing flow at any time, leading to situations like Not Again!

We fortunately gave up on patching our synchronous system to accommodate increased scale. Trying to decrease the likelihood of transient system failures and overload was an inferior approach to gaining greater control over the load and impact of failures on this system.

The past incidents related to the image processor highlighted three key issues with our synchronous image processing flow. The listings pipeline:

Had to wait for image processing to finish.
Overloaded the image processor and caused it to fail.
Couldn’t recover images lost as a result of transient image processor failures.

Robust Image Processing

Following in the footsteps of many data engineers before us, we decided to solve our problem by making our image processing system asynchronous. We chose message queues as our async communication mechanism based on the four key benefits they would bring to our listing pipeline and its image processing flow:

Faster listings processing: Listing pipeline could request image processing and continue with its subsequent steps
Steady request load: Image processors could pull from the queue at a rate they could handle
Horizontal scalability: If the queue backed up, we could increase the number of image processors and they would start consuming from the request queue immediately
Reprocessing: Image processor could recover from transient failures by re-queueing requests

We decided to host our message queues on our newly created internal Publish-Subscribe (PubSub) library, backed by Amazon’s SNS and SQS. In this new system, request and response queue items replace the synchronous image processing requests and responses, decoupling image processing from the listings pipeline.

The request queue items include unprocessed images and just enough information — a unique foreign key, an MLS source tag, and a region name — to keep track of the listings to which each set of images belongs. The response queue items include processed images and the same listing tracking information.

Listings Pipeline with New Image Processing Flow

Our first new component, the PubSub Image Request Processor, processes images from the request queue and pushes the processed image results onto the output queue. Our second new component, the PubSub Image Response Processor, consumes from the output queue and saved processed results to region-specific image MongoDB collections. [1]

Performance Bottlenecks of New Image Processing Components

Combining the request and response processors would have simplified the new architecture, but we kept these components separate to optimize for different resource requirements. The request processor performs CPU-heavy image resizing. It uses multiple threads, each of which is matched to one CPU core, and therefore works best running on its own set of machines. The results processor formats and saves processed image results to Mongo, I/O heavy operations that demand little from the CPU. Furthermore, using too many parallel threads with the results processor could overload Mongo (a hard-to-track and dangerous failure which we’ve since encountered).

Rollout and Results

My manager deserves most of the credit for the smooth rollout of our asynchronous image processor. I was hungry to get the glory for rolling out the cool new infrastructure, but in his wisdom he insisted we move from our synchronous system incrementally, using MLS-specific feature flags to turn on the asynchronous mode at a rate of one feed per day.

Overload

9:00 AM: I’m walking into work, caffeinated and ready to go. I open up my computer and see the Slack notification icon from my nightmares, one new message from our product manager: “Hey, is something going on with images? All the new ones are showing the ‘coming soon’ placeholder.”

Uh oh. I check the SQS dashboard — the requests queue is piled up with a few hundred thousand images. I panic for a minute and then ping #devops:

I start my back-of-the-envelope (read: Google Calculator) calculations with a goal: drain the queue in an hour. With 100,000 listing image requests, ~20 images per listing, and 1 image per second per thread, 60 threads working roughly full-time should get us through all the images in less than an hour. I launch four additional c4.2xlarge EC2 machines (a later Slack message added that I was launching two more machines than I’d initially requested) and deploy the request processor job on each. I wait. An hour of frantic checking and re-checking of Amazon’s SQS Cloudwatch graphs passes and the queue’s back to normal. I shut down the machines.

In our old system, jobs would have started failing at this point, so a ‘coming soon’ placeholder causing pile up led to relatively less stress than immediate failures would have. This disaster-averting potential of our new infrastructure affirmed our decision to move forward with this project. Asynchronous image processing also eliminated our data loss problem. Whereas in our old architecture, exceptional failures caused us to drop images, requiring manual steps to reprocess them, our new system automatically re-queued image requests.

Testing

While asynchronous processing has some advantages, it does have a downside. Our new system makes development less convenient because we can no longer run everything locally. Testing changes that affect image processing requires running the two jobs and making sure your jobs use the right SNS & SQS namespaces. [2]

Debugging

Debugging actual issues with the image processor has become easier in some ways, harder in others.

Pros

We can trace failures back to the image processor and avoid layers of misleading errors.

Cons

We can’t query or look at the items on our SQS request or response queues until those items arrive at the image processor.
Troubleshooting failures can require cross-referencing logs between image processors and their clients to reconstruct the timeline of events.
We have to ensure that real image processing failures (e.g., failing to download an image), don’t get re-queued forever [3].

Follow-Up Improvements

Unexpectedly, a few months after introducing this change, we started encountering more and more queue back ups as a result of needing to reprocess large batches of images. Updates to high-priority listings’ images would get blocked for hours when we tried to batch reprocess feed data, angering customers. After this same issue arose a few times, we took two mitigating actions: adding Amazon CloudWatch alerts triggered when number of items in the queue exceeded a threshold and splitting the request queue into two queues with different priorities.

In retrospect, setting up the Amazon CloudWatch alerts was obvious. These alerts send an email to our team members when items in the queue exceed 100,000, the magic number we found to signal potential issues through trial and error.

During our initial implementation, we didn’t split the request queue into two queues because we wanted to keep the system as simple as possible. In the end, we split the request queue into a time-sensitive queue and a batch queue. This split required less work than expected since our PubSub processor library allows per-thread configuration of SQS topics. For image processing, the processor polls the time-sensitive request queue on some threads and the batch request queue on others, providing simple but good enough load balancing between the two queues. We haven’t had a single issue with queue backup in production since we split the queues.

Takeaways

Messaging queues are often recommended by bloggers on the internet, but I’ll do my best to offer unique insights.

I stand by a contested decision we made to limit the scope of this project. Switching from Mongo or moving image processing to Python would have slowed down the project and increased the likelihood of failure. While new features and technologies excite me as much as anyone, keeping the scope of our change relatively narrow allowed us to focus and deeply understand the requirements and risks associated with it.

This project taught me fear-driven development, or how to let fear constructively drive my development efforts. Before my work on the image processor, I’d constantly feared that changing or deploying it would cause an outage. Occasionally, I’d even get imposter syndrome, wondering if the fact that I was so scared to touch the ImageMagick operations in the processor meant that I wasn’t a “real programmer”.

Changing the image processor taught me to treat fear as a signal. I’ve since been delighted to see other relatively new members of the team come to me with changes they want to make to the image processor and not experience the same crippling fear.

Obligatory Compass Plug

If you liked this blog post and would like working on things similar to what I described, apply to work on the data team at Compass!

Acknowledgements

Thank you to Raymond Wang, Matthew Ritter, Russell Stephens, Raquel Bujans, and Jean Barmash for their helpful feedback on drafts of this post.

Footnotes

[1] As an aside, when we first presented this piece of the design to engineers on other teams, a few objected that using a SQL database would make these joins easier. This is undoubtedly true, but we ignored it anyway. Taking advantage of SQL joins for this would have required moving our listing collections to SQL in addition to these new image collections. This violates the informal rule I mentioned before: decouple infrastructural changes as much as possible (some might call this the legacy code version of “worse is better”).

[2] To support unit and end-to-end testing of components that use the PubSub, we’ve implemented in-process and local PubSub implementations which allow clients to use the PubSub without an external dependency on AWS.

[3] This becomes more complicated for transient failures in which requests need to be re-queued after a data provider fixes some images. We’ve settled on a simple but reliable solution that allows to developers to manually trigger an ingestion pipeline run which includes downloading and reprocessing of the images for one or more listings. In the old system where failures cascaded, an incident would require running this mode for hundreds or even thousands of listings. In the new system where failures affect specific listings and don’t cascade, we seldom to need to resort to this manual solution.