The Big Data Infrastructure Underpinning Powering Autocomplete for Adobe Lightroom
Chances are, you’re probably familiar with autocomplete: a feature in many modern applications that provides suggested completions as you are typing. In Adobe Lightroom, one of our most popular apps for image organization and image manipulation for desktop and mobile platforms, autocomplete suggestions include the customer’s assets and related metadata, guiding them through the product and helping them discover product features. Autocomplete also helps reduce spelling errors, which in turn reduces null search results and improves the overall search experience.
As shown in the example above for Lightroom desktop, when you type “f” in the Lightroom search field, autocomplete suggests completions, including:
- “Flag,” a facet name that starts with “f”.
- Facet values that contain “f”, such as location, lens type, and F-stop value.
Autocomplete is particularly useful on mobile devices, as illustrated below, since it reduces the effort required to perform a search on a small device where typing can be hard. For both mobile and desktop users, it’s a huge time saver. In this post, we’ll break down how the Adobe Sensei & Search team partnered with the Lightroom team to implement autocomplete in Lightroom and shed some light on the scaling challenges associated with dealing with indexing all aspects of an asset for such a popular application.
Breaking down autocomplete in Lightroom
In general, there are two main types of autocomplete:
- Autocomplete based on user queries: The results are non-personalized and based on the most common queries made by users of the product and search hit counts. The results are not necessarily tailored to a particular user.
- Autocomplete based on metadata: The results are personalized and based on the specific user’s content and asset metadata. The results are tailored specifically for each user, and every user may get different results for the same queries. This is the kind of autocomplete used in Adobe Lightroom.
In Lightroom, users can type three categories of things and get the following completions:
- Metadata values
- Facet values
- Facet names
Next, let’s take a look at an example of a common use case for Autocomplete in Lightroom:
As illustrated below, by simply starting to type “location” in the search field, Lightroom autocomplete shows suggestions of the places where your photos were taken, allowing you to easily find all photos associated with a place or a recent trip.
The challenges associated with autocomplete in Adobe Lightroom
Lightroom autocomplete works by aggregating common terms across metadata and other user information, indexing this information, and then displaying it to the user backed by a count-based ranking. It is personalized and requires a custom aggregation of information for each user.
The amount of data that Lightroom generates in production is huge. Each event can be anything from a new asset (a photo or media file that a user uploads), changing an asset’s metadata, or just applying a filter to an existing photo. Each of these actions generates an event that is sent to the HBase datastore. Lightroom has many customers, and each customer may have anywhere from a few thousand assets up to millions in the case of professional customers. So you get a sense of the scale at which this aggregation needs to be done. The scale increases even more once you understand how we fan out this information, as described in the next section.
Batch ingestion pipeline
A Storm pipeline processes all Lightroom data that comes from asset metadata changes (upsert/delete) and populates the Search Elasticsearch index. Apache Storm is a distributed real-time computation system that makes it easy to reliably process unbounded streams of data.
Storm also writes that data into an HBase data store called the content store. Apache HBase is a large scale distributed key-value data store and the data residing in it acts as the “source of truth” for Lightroom autocomplete and the starting point of the batch ingestion pipeline.
A Spark job then picks up the data in the content store, computes the aggregated metrics before finally writing into the Autocomplete Elasticsearch index. Apache Spark provides both batch and stream-processing capabilities and is ideal for performing distributed analytics at scale. It is used for tasks from batch data ingestion for analytical pipelines to efficient large-scale feature aggregation for machine learning applications.
Each asset event contains a payload of all possible information about the asset, and is stored in the HBase datastore. Users can search within their collection(s) of assets in Lightroom, using basic metadata and facet textual metadata, as detailed below.
Currently Lightroom provides autocomplete for 22 fields. These fields are all part of each asset payload event and all have a variety of different data types associated with them.
Now comes the slightly tricky “fan out” part. Since autocomplete in Lightroom is personalized, it needs to aggregate the counts for values in each of these fields across all assets for a given user, and store them in an Elasticsearch index. The aggregated counts are needed to ensure that autocomplete on those fields is consistent with the user’s activity, so that we don’t show suggestions for metadata or assets that have been removed. This step is basically converting all information at the asset level to a net aggregated user level information, as illustrated below.
The logic also gets more involved for partial updates to existing assets. Inserting an asset is relatively easy, but handling the deletion of assets or even just removing metadata for an existing asset is challenging, since the user should not get suggestions for metadata that they’ve explicitly deleted!
The primary datastore is HBase, a distributed key value store, so it’s good for random access reads and helps with fast retrieval of certain records based on a particular row key. It offers guarantees for high throughput. Despite this, for our use case, we had to make a large number of reads. To read the unchanged data from the base content table (at times this would be in the range of 100,000 or more). After a particular point (a certain number of records), we noticed a significant performance drop in HBase retrieval. Beyond 500,000 keys, filtering HBase using the row keys became extremely slow, and we had to estimate the number of row keys that we could efficiently filter at one time.
We got around this bottleneck by making requests to HBase in batches. The number of executors that could read from HBase was limited by the way the data was partitioned. So even if we have a 100 executor spark job, if HBase had only five partitions, then only one executor is assigned per partition and the other 95 sit idle.
Writing to S3
We explored the idea of trying to replace the HBase content store with all the data on an Amazon S3 bucket and storing the data partitioned at the user level in parquet format. S3 is a single flat blob storage system, with no concept of a file directory structure, so we ran into the S3 rate limits while writing from a Spark job. S3 currently supports 3500 PUT/POST/DELETE requests per second.
Another issue is that, when hitting S3 at a high rate, some spots of this storage blob might get “hot,” while others would remain cold (untouched) because of which we started seeing records being dropped.This happens because the data is not written uniformly, and depends on how the data (assets) for various users are distributed. So these hot spots were resulting in data loss. Hence we decided to rethink the way we partitioned the data and write it out to HDFS instead.
When dealing with large data sets, a skew in the data (data imbalance) can affect the speed at which the pipeline processes the data, since certain stages in the Spark job could be held up, depending on how the data is partitioned and sent to the various workers.
Updates in Parquet
In our efforts to replace the base HBase table we tried storing the data in various storage formats, as mentioned before. We wanted a data format that would help partition and write to disk, as well as efficiently read and filter that data. This was mainly to get around the HBase bottleneck. We stored the data for many of our intermediate stages in Parquet.
Parquet is a columnar storage format, whose benefits include:
- Fast reading through compression and support for partition pruning
- Fast filtering: column projection and predicate pushdown
But here again, we hit a minor hiccup. Parquet does not support updates, so the only way to modify a parquet file is to completely re-write it. This would become problematic if a user for example had millions of assets, but only a few changed in the last few minutes. This would mean we would have to over-write all those assets just to incorporate the changes in a few.
Throttling “write” operations to the Elasticsearch cluster from a Spark job is difficult. While indexing tags, we faced the issue of bulk rejections from the cluster. It is critical to find a way to constrain these writes, to protect the cluster. This can be done in two ways:
- Controlling the number of concurrent writes (spark executors running and the number of task threads per executor running).
- Controlling the batch size of each entry, which helps control the number of documents written at a given time in each request.
Key takeaways from implementing autocomplete in Lightroom
Working on this search feature was not always easy to deal with, since our logic worked smoothly in development and staging environments, but not in production. This was also a lesson in flexibility, as we had to tear up our old design and come up with a completely new one to tackle this issue. It also highlighted the importance of having an environment that closely replicates not just the functional aspect of data flow in production, but also the rate at which data comes in.
One of the new autocomplete features we are powering through this pipeline is to provide suggestions by ranges for facet values, which helps you find all photos which fall in that category:
In addition, Lightroom autocomplete also now enables you to directly find photos of specific people from your collection of photos by providing suggestions for the names of people in your collection.
Moving forward, we plan to look at ways to optimize the infrastructure costs incurred, as well as explore alternate storage options and a new design (such as structured streaming) to improve the pipeline while reducing costs.
For more examples of the work the Adobe Sensei & Search team is engaging in to power intelligent features across our suite of products, head over to the Adobe Sensei hub on our Tech Blog and check out Adobe Sensei on Twitter for the latest news and updates.