Handling large data sets at scale

Tim Koh | Pinterest engineer, Discovery

Millions of people search for ideas on Pinterest every day. Since launching search guides two years ago, we’re now handling two billion search queries every month. A core component of our search stack is our query rewrite service which we use to understand the query and rewrite it as needed. We do many key-value pair lookups in this service, so it’s essential that we pick the right data structures.

A common solution for a key-value pair structure is an in-memory HashMap, but we faced issues with this for several reasons. First, as our memory footprint continued to grow, we faced garbage collection issues, such as long GC pauses. Second, loading the data into memory became increasingly slow, which made restarting the service a huge pain and slowed development. But we’re getting ahead of ourselves. Let’s first look at a brief overview of the service.

Architecture

When a search is initiated, the query is passed along to the different workers in the system. Each worker performs a rewrite, and the final rewritten query is output. Workers need access to various data sources to make rewrite decisions. Every data source is a key value store a worker can access multiple times. For example, the spell correction worker does many look ups to find a valid and most probable spell correction.

Requirements

Now that we have a brief understanding of the system, let’s talk about requirements. When we used in-memory hash maps for our key value structures, the service had a low latency of <10ms, and we wanted to keep it that way. Since we do many look ups for each search query, we decided not to rely on external data sources because the round trip latency would be too costly. We also needed a low heap memory footprint to avoid garbage collection issues. Finally, we wanted to reduce the startup time of the service.

We explored two different solutions, and each had its own limitations. However, we learned if we combined solutions, the outcome worked well for us.

Finite state transducer (FST)

Finite State Transducer (FST) is a special kind of finite state that transduces a sequence of input symbols into another sequence of symbols. FST can be represented as a directed graph where each node represents a possible state and each edge represents a possible transition from one state to another. Each edge is associated with an input symbol and an output symbol. FST can transduce input symbols by traversing the graph from the initial states and accumulating the output symbols associated with the traversed edges. FST is widely used for natural language processing.

Because FST stores input symbols as edges in a direct graph, it stores a set of strings efficiently when they’re short and have many overlapping characters. For example, a set of strings {“apple”, “apples”, “apply”} share the same states and edges for the common prefix “appl”. On the other hand, an in-memory hash map needs to store the original strings, because there’s no guarantee their hash values are unique.

Lucene provides an implementation of FST in Java, so we built a simple FST-based key-value store where keys are NGrams extracted from our search documents and values are the normalized frequencies of those NGram. A Scalding job extracts NGrams from our search documents, builds FST from the extracted NGrams and stores the FST binary onto Amazon S3. The FST-based NGram storage is used in our indexing pipeline and our query understanding service to help analyze search documents and queries. By switching from an in-memory hash map storage to FST-based storage, we lowered the memory consumption of our query understanding service by 90 percent.

HFile

HFile is a file format used in HBase. We use HBase version 0.98.13 to maintain compatibility with other systems, but the same concepts should also apply to other versions. Instead of sending a network request to an HBase database, we use code from HBase as a library to read an HFile directly from local disk. Yes, disk reads are slow, but the process is improved with the use of bloom filters and caching. Using HFiles worked well for us, so here we’ll share some ideas we explored (because if it worked for us, it could work for you too!).

In deciding which data sets we could use HFiles for, a major factor was the nature of the data and its access patterns. Because HFile reads can be guarded by a bloom filter, it works perfectly in situations where most look ups are for keys that don’t exist in the data set. For example, most of the look ups for spell correction aren’t valid words, and the bloom filter prevents us from doing unnecessary disk reads.

The cache also helps reduce the number of disk reads. HBase comes with a block cache implementation we can use almost right out of the box (yes, there’s some tuning involved, such as deciding which cache implementation to use and how much heap and/or off heap memory to allocate to the cache). (This blog post gives a good explanation about the different implementations, and this post provides insight into tuning it.) The different implementations allow us to cache items on the heap in off heap memory or on disk, such as an SSD. The trade off is, of course, between performance and size of the data you can serve. However, this makes HFile scalable to much larger data sets since we’re not limited to the heap.

Another important consideration was the encoding and/or compression algorithm to use for the data. (These links provide information on choosing the right algorithm, link 1 and link 2.) The question of whether to use encoding and/or compression isn’t obvious and depends on the data and use case. The obvious trade off is the size of the data vs. the speed of retrieval, but there are other more subtle considerations. For example, encoding allows for a smaller block size stored in the cache so more data can fit, but decoding takes time. Additionally, compression makes the data smaller so disk reads are faster, however, there’s also the time cost for decompression. Another question is whether to decompress the block before or after it’s cached.

Serving data from HFiles solved both of the above problems. There’s no need to load the data into memory, which greatly reduces startup time, and the amount of heap memory to use for caching is configurable, so we could tune this to avoid garbage collection issues. In addition, the ease of scalability is a huge win as we continue to grow our data size. Using HFile has also provided us with opportunities for product improvements since we’re able to support product changes that require large amounts of data at serving time.

Conclusion

FSTs and HFiles are both great ways to tackle the problems we faced. With the size savings gained from using FSTs we can keep our smaller data sets on the heap. We found FSTs to be a little faster than HFiles, so we use FSTs where we can and use HFiles for data sets that are too large to fit into memory. The ability to serve large data sets also opened up opportunities for new product features that relied on large amounts of data. Through the use of these structures, we stabilized our systems and made our developers happier.

Acknoweldgements: Thanks to Keita Fujii for co-writing this post and implementing FSTs in our system, as well as Roger Wang for his technical guidance and help in important optimizations.