How We Improved Our Performance Using ElasticSearch Plugins: Part 2

Xiaohu Li
Xiaohu Li
Sep 19, 2019 · 7 min read

Written By: Daniel Geng, Software Engineer | Pierre Poitevin, Senior Software Engineer| Xiaohu Li, Engineering Manager

We introduced the architecture and baseline performance benefits of the ES plugin in Part 1. In this post, we will focus on a specific customization that removes one of the largest bottlenecks in the recommendations ecosystem.


When we query ES to fetch recommendations to serve, we need to send a list of users to skip. For example, users that you have already seen recently and users that you are already matched with should not be recommended to you again. This skip list can be reasonably high for very active users. We use the terms query on ES for the skip list.

Terms query example:

However, we suspected that the “terms” query was inefficient for very large lists. We conducted a performance test using queries with skip lists of different sizes using the “terms” filter. From the results below, performance and skip list size have a clear inverse relationship.

p50 latency comparison with different skip list sizes (terms filter)

As seen above, the skip list is a considerable bottleneck in our ES performance. We also performance tested the omission of other query components, but the skip list was by far the largest contributor to latency. While it would be possible to reduce its size to improve performance, this could result in a negative user experience since users would potentially see duplicate recommendations. Our goal is removing the bottleneck but leaving the business logic intact.


Fundamentally, the solution is to find an alternative to using the terms query. Our idea was to send a serialized skip list using a compressed data structure, which could then be deserialized and used on the ES server. Assuming that the serialization and deserialization overhead is acceptable, not only would this reduce latency by avoiding a large terms query, but it could also greatly reduce the size of our query requests.

Now that we are familiarized with the usage of the ES plugin, we thought about how we could leverage it to optimize the skip list. In addition to adding a new script that could be used by our LoaderPlugin, another possibility was to add a new custom API using an ActionPlugin (similar to what we did for observability in Part 1). We will cover implementation details and tradeoffs below.

Plugin Types


To use the serialized skip list through a custom API, which we will call “_newsearch”, the following steps must be made.

  1. Serialize skip list on the ES client.
  2. Send a query to ES using the _newsearch API and pass in the serialized list.
  3. In the ES cluster, the query node sends a search query to the data node without the skip list. The requested document count is equal to the requested document count sent by the client plus the size of the skip list because the skip list will be applied on the query node.
  4. Receive the ranked documents in the query node. Deserialize the skip list. Include documents that are not in the skip list up to the requested size sent by the client and return to the client.


  • Easy to implement


  • Unnecessary processing: the skip logic occurs after the sorting phase, so relevance factor is calculated for documents in the skip list
  • Effect on data node load is query dependent
  • Increased load: potentially needs to rank extra documents, which may be heavy for queries with large skip lists
  • Decreased load: skip list handling is moved to query nodes
  • Increased load on query nodes since it needs to deserialize and apply the skip list
  • Updates require cluster restart since it does not take advantage of the LoaderPlugin
  • Increased network traffic between query and data nodes


To use the serialized skip list by leveraging the LoaderPlugin from Part 1, we will need to add a new script to deserialize and apply the skip list. This new script will use the following workflow.

  1. Serialize skip list on the ES client.
  2. Send a query to ES through the standard _search API. Send the serialized skip list through “params” in the request. Specify a script that uses a skip list deserializer in the “source” field. Add a “min_score” (a field from ES) parameter to the query (used in next step). Here is an example:

3. On the data node, the skip list will be deserialized. For documents that should be skipped, the script will return a relevance factor lower than min_score, so they will be omitted.

4. The remaining documents are returned to the client.


  • Updating a script is easy since it uses the LoaderPlugin
  • No unnecessary relevance computation done on data nodes
  • Reduced load on data nodes, since deserializing a skip list is much faster than having a large terms filter
  • No increased load on query nodes


  • Using min_score for skipping may have limitations — it can be difficult to determine the correct threshold

We conducted a performance test comparing the two implementations using bitmap serialization and a skip list of size 50k.

ActionPlugin vs. LoaderPlugin: p50 latency (50k skip list size, bitmap serialization)

While the latency was similar at lower QPS, the difference is quite obvious at 125 QPS. As we originally expected, the LoaderPlugin yielded much better performance.

Data structures

Now that we have determined which plugin implementation to use, we still had to decide which data structure to use to serialize the skip list. To help us make a decision, we conducted more performance tests for comparison. We tested the following data structures.

The size of the serialized skip list has a potential impact on ES network latency since it will be included in the request. Using a skip list of size 10 million, the serialized skip list size of each implementation is shown below.

Since a standard hash set is not meant for compression, it was expected that it would be larger than the raw values in terms. Inversely, the bloom filter and roaring bitmap generated serialized skip lists that were much smaller. Although a smaller size will result in reduced network bandwidth usage, it may not have any correlation with reduced latency or cluster load. Therefore, we implemented each data structure using a LoaderPlugin script and tested latency using various skip list sizes.

Below are the results when comparing the data structures using a skip list of size 10k.

P50 latency with 10k skip list size

It was clear that bloom filter and bitmap were much better than the rest of the pack. We did more performance testing comparing those two with larger skip lists. Below are the results when using a skip list of size 40k.

Bloom vs. Bitmap: p50 latency with 40k exclusion list

Even though bloom filter is slightly faster than bitmap, it has the issue of false positives. Since the difference in performance is small, we decided to use bitmap because it does not impact our business logic.

Final decision

In the end, we decided to adopt RoaringBitmap as the data structure and implement it as a LoaderPlugin.


By quickly iterating through our ES plugin development cycle, we are able to validate the functionality of this plugin in production by keeping the hit size intact and roll this out 100% transparently to our valued users.

More importantly, we see even greater performance gains in production than our performance test results. Here are the numbers:

  • Around 80% reduction across all latency percentiles(green line is original “terms” query, blue line is RoaringBitmap + LoaderPlugin):
P50: 65ms -> 13ms
P90: 170ms -> 35ms
P99: 400ms -> 80ms
  • Around 35% — 50% CPU utilization drop for query nodes and data nodes respectively.
Query Nodes
Data Nodes
  • Around 50% reduction on query and data nodes’ network IO

We are very excited to announce that by releasing this plugin, we are not only able to provide a better user experience while keeping business logic intact, but also gain significant headroom of our cluster capacity for future growth.


After building out a framework for plugin dynamic loading and iterating, we pushed our cluster’s performance to the next level by actively identifying our current bottlenecks, investigating and testing different options, and finally delivering benefits to our end users. A few key findings we got during this process:

  • Large size of terms filter results in bad performance. Lucene/ES is not designed to handle such cases efficiently
  • Among all the solutions we explored, RoaringBitmap provides the best compression rate, a relatively fast serialization/deserialization performance, and avoids false positives
  • Performance tests provide critical insights for design decisions
  • By leveraging scoring functions to handle skip, we are able to support more sophisticated logic going forward, as long as the information/logic is passed through the query parameter

This concludes our latest innovation on how we operate our ES cluster and make it hyper-scaled, which is just one of many engineering challenges we are tackling at Tinder. If you are interested in challenging yourself and want to work with talented teammates, please take a look at our job website for openings.