Migrating to Elasticsearch with dense vector for Carousell Spotlight search engine
As a classifieds marketplace, Carousell has tremendous amounts of listings from sellers. To present the most relevant listings to buyers, a search engine is required. Usually, search engines are equipped with basic functionalities such as keyword matching, predicate filtering, relevance ranking, and pagination etc. In Carousell, we experimented with a new relevance model called embedding, and we managed to apply it on one of our internal products called Spotlight. This article is going to explain how we built our own internal search engine to support embedding-based relevance ranking, and eventually migrated over to Elasticsearch with dense vector.
In 2018, we implemented an internal search engine called Carolene with Apache Lucene to support embedding-based relevance ranking. Two years after that, we migrated to Elasticsearch 7.8 with dense vector feature, and it works like a charm!
Spotlight is one of our internal seller tools for achieving faster seller success. It helps boost the visibility of the seller’s listings in the marketplace.
When sellers purchase a Spotlight for their listing, we store the original listing details in our dedicated search engine (different from the one for organic search purposes) and serve them in search results and curated content sections whenever deemed relevant.
To make the Spotlight listings relevant to the placement, we have to retrieve ones adhering to the context. For instance, we want to display Spotlights on an organic search response page, we need to apply the same predicates that the organic search uses, which can include keyword search, predicate filtering, geo-range filtering, and even AND-OR operations of these mentioned. The sequence diagram below illustrates a simplified flow of how Spotlights are served on a search page.
Before jumping into the sections about Carolene, I want to address a few questions that you might have.
Why not Elasticsearch in the first place?
Apart from the normal information retrieval functionalities (TF-IDF, filter, AND/OR), Spotlights need to be indexed with embeddings, and more importantly, queried with embeddings using dot product as the final relevance score. At the time of implementing Carolene, Elasticsearch was not capable of this. Hence, we decided to build our own.
What is the embedding score?
Embedding is a very commonly used representation in Natural Language Processing (NLP) for text analysis. It is usually in the form of vectors in a high-dimensional space. As shown in the picture below, each dot in the space represents a document, and distances between dots represent their similarities. You can use various measures, like cosine distance or dot product distance. Choosing which measure depends on how your embedding was trained in the first place.
Embedding has proven to be not only mathematically convenient, but also much more performant in polysemy, synonymy, and semantic matching.
Internal Solution: Carolene
The system was required to support keyword matching, (keyword, geo-range) filtering and most importantly, finding the nearest embedding documents. We chose to implement a gRPC service that internally wraps around the Lucene core library and extends from it to support vector operations. We named it Carolene as a wordplay of Carousell and Lucene. This project was majorly contributed by my colleague Varma. 👏
Vector and VectorNearQuery
It is fairly straightforward to support TF-IDF, filters, pagination, etc with Lucene in Carolene. The essence of Carolene is the vector plugin, which helps us achieve embedding/vector operations. Embeddings are stored in a customised Field of type VectorType. When an embedding is provided, Lucene’s DocIDIterator traverses the entire collection performing the dot product score calculation before returning the top-n. This is an O(MxN) operation. N is the number of documents and M is the dimension of your embeddings.
To achieve higher availability, Carolene is architected to be distributed with a single-leader-multi-follower mechanism, orchestrated by Consul. Each index in Carolene is a single-shard index, and fully replicated in all nodes. Changes committed in the leader will replicate to followers using the Lucene replication interface via gRPC streaming. The graph below illustrates the basic architecture of Carolene, followed by some explanations.
- The Envoy load balancer is agnostic about nodes’ identity. It routes reads and writes requests to all nodes in a round-robin fashion.
- All nodes can process read requests.
- Only the leader node can process write requests.
- If a follower node receives a write request, it routes that to the leader to process and wait for the leader’s response.
There’s a lot more about Carolene skipped here, such as index storage, delta change replication, leader election, etc. As an internal search engine solution, Carolene served its purpose for 2 to 3 years. Embedding has also proven to be very performant in retrieving relevant documents than just TF-IDF.
As our index size grows from a few hundred documents to more than 10,000, Carolene started to surface its shortcomings.
- When there’s a sudden spike in traffic, it can cause multiple nodes to go out of memory (OOM). We have a caching layer on top of Carolene. It normally has a 70%~80% hit rate. But it still cannot help in situations where there are too many cache-miss queries.
- When the leader goes into OOM state, the election protocol might elect a follower with a stale version of the index and other followers will start to reject its replication. This is mostly because writes are only committed to the leader and followers sync with the leader via periodic replication. If we implemented the write process to be quorum writes, we might’ve avoided this.
- We also observed that if a follower dies during a replication write, on recovery it can get a corrupted index (ES and SOLR have an index cleaner operation on startup)
We have developed a few workarounds to mitigate the issues above. We also minimised the reindex time of the entire cluster to be within 20 minutes. Whenever the whole cluster enters an unhealthy or unrecoverable state, we recreate it in half an hour. It is very risky to use that as our primary mitigation approach.
Hence, we started to re-evaluate the system and seek new solutions. We have evaluated Facebook’s Faiss and Spotify’s Annoy. Both are Approximate-Nearest-Neighbour (ANN) solutions. While our case is more of a precise K-Nearest-Neighbour (KNN) use case, it is not the reason we didn't go ahead with them. One of the reasons that we didn't use Annoy was that it requires offline creation of the index, while our index pool is pretty dynamic. Most importantly, neither of them supports TF-IDF, filtering, or geo-range queries, which are mandatory in our use cases.
Elasticsearch with Dense Vector
Finally, we found one promising candidate: Elasticsearch dense vector. Since Elasticsearch 7.0, they have introduced dense vector as an experimental feature and made it generally available since version 7.6 in Feb 2020. Dense vector with the similarity functions Elasticsearch provided (DotProduct and CosineSimilarity) offers the exact functionality of Carolene’s vector plugin.
Be aware that there’s NO magic in Elasticsearch! It’s still a O(MN) operation.
But Elasticsearch as a holistic search engine solution, offers mature cluster orchestration, extensive management APIs, high availability, and resilience to dynamic traffic patterns. Also, index sharding and replicating can boost performance, which Carolene was not capable of.
To safely onboard Elasticsearch, we developed a proof-of-concept (POC) to examine its functionality and performance.
This POC system was deployed as
- 1 Master Node (1vCPU, 4GB)
- 6 Data Nodes (1vCPU, 4GB)
- 1 General purpose node for running benchmark scripts, metrics collection, and monitoring
The functionality test was done separately and it was fairly easy to finish as well. There’s no surprise that Elasticsearch supports everything we require including embedding operations.
The major concern is whether Elasticsearch’s dense vector operation can meet our latency requirements, given our index size and query patterns. In addition, we wanted to test out the best index configurations for our index size, which are index shard number and shard replica number.
In a nutshell, we captured the variables of the tests in 4 dimensions. Below are those with their tuning values.
- Index shard number: 1, 2, 4, 6
- Shard replica number: 1, 2, 4, 6 (max)
- Number of documents in index: 13k, 26k, 52k
- Embedding Queries per second: 200, 400, 800, 1200
We tested all permutations of above 4 dimensions. Each round, we configured a brand new index with index shard number and shard replica number. Then we ingested the configured number of documents. Finally, we spawned up enough goroutines and fired random embedding queries at the configured QPS to the system.
Each round will run for 10 minutes, and we observed P50, P90, P99 latencies in each round. We also monitored the general health of nodes which includes CPU usage, memory usage, open file descriptors, etc.
To simulate the real-world production traffic patterns, we used a side goroutine to generate constant write traffic.
Carolene vs. Elasticsearch
We didn’t observe any abnormal activities in those nodes during the performance test. CPU usage was kept within full capacity in all rounds except those with 1200 QPS, which is way beyond our current production QPS. Moreover, Elasticsearch did give promising results in terms of latency.
We use our production latency of Carolene as the baseline for this evaluation. Latencies observed for the index with 13k documents under 400 QPS were as follows. It’s worth mentioning that the Carolene cluster was deployed with more powerful machines (9 x 4vCPU, 16GB), and those latencies below include non-embedding queries, which are much less latent in general.
- P50: ~4ms
- P90: ~30ms
- P99: ~100ms
In the POC setup, for the same 13k documents index under 400 QPS embedding queries, we observed latencies as follows, in the most performant index configuration: 1 shard, max replica
- P50: ~43ms
- P90: ~77ms
- P99: ~121ms
If we estimate the current Carolene setup to be 6 times more powerful, the latency looks very decent for Elasticsearch. We soon started the real migration and it turned out to work better than we expected 🚀 with even fewer resources. After migrating all traffic to Elasticsearch, the actual latencies observed were
- P50: ~8ms
- P90: ~24ms
- P99: ~77ms
With the flexibility of Elasticsearch, we quickly tried a few more experiments, such as field boosting and Chinese tokenizer, which both worked very well⚡.️ We also have standardised APIs to manage our indexes and monitor the cluster’s health. As much as Carolene was missed, Elasticsearch quickly became our new favourite.
If you are also considering Elasticsearch or dense vector for your use case, I do have a few tips and hope you can find them useful.
- There’s NO silver bullet for index configuration. Run a performance test to find the best config for your system.
- Use separate indexes as early as possible. For instance, if you have observed that certain predicates always appear in your query, you should consider splitting your index by that. Leaner indexes are easier to manage and also faster for retrieval.
- Filter as much as possible, before vector operations. In the O(MxN) operation (N: number of filtered docs, M: the dimension of embedding), M is pre-defined so it’s fixed. But we can reduce N by applying filters first.
- Evaluate properly before deciding to go ahead with dense vector. The dense vector approach is to find precisely K Nearest Neighbours (KNN). It works well for small or medium-sized indexes which favour high precision. On the other hand, Approximate Nearest Neighbours (ANN) solutions might work better for use cases with a huge index that favours high recall and can tolerate approximation. Again, you can explore Facebook Faiss or Spotify Annoy.
Thank you for reading this.
Clap if you like it.
Join Carousell if you love it!