Scaling Search by Sharding at Blibli

Published in

Blibli.com Tech Blog

8 min readDec 16, 2021

As the production of BliBli’s search grows each year in terms of number of catalogs with a steady rate of 30% per year so does the catalog updates along with that. In BliBli we use Apache Solr to power our search systems, with just single shard since 2017. Since then our catalog size has grown 15x and average requests per minutes by ~13x (as of Q3 of 2021). Although the numbers looks very ravishing, but for our Solr collection running in just a single shard it was clinging to its last breaths. Given, we use expensive query components like faceting over 12 different fields, collapsing, stats and query time synonym expansion in every request.

Given the situation we did our math for the upcoming anniversary sale for Q3 2021 with the estimated projected traffic, our search systems won’t scale. Hence giving us just 1 month before the sales to scale the systems to sail through the event with 0 downtime and keeping the response times below the SLAs. Here is how sharding our collection to achieve the record high request throughput with the record low response times, during sales.

When you should shard your collection?

Here are few symptoms you should be looking out in your systems to know it’s the time:

High garbage collection time: with the growing index size IndexSearcher, will collect lot more relevant documents with every search request, thus resulting in huge pile of dead objects to clean up. We saw a very slow but a steady rate of increase in GC CPU time with the growing index. We can tune the systems up to certain level (do check out this article on how we tuned our search system), but after some extent it will give up.
High memory consumption: Systems will suffer a memory crunch and need to increase the VM memory every month. The crunch won’t be from user memory used by Solr JVM but instead there be crunch in OS buffer caches, because Lucene is trying to fit the entire index segments inside it. When there OS could not fit more index segments on buffer, in that case our disk will take a hit with high read ops resulting in high response time.
Frequent OOMs while doing faceting: With the huge index faceting over a high cardinality field is expensive operation often leading to OOM
Rising infrastructure cost: With the growing hunger for memory and compute for garbage collection, our search infrastructure was digging a big hole in our pocket.

How to? — Sharding / Document routing

Now you know, you need to shard your collection here is how you will do it. Optimal way depends upon the type of query (or the business use case)which we are having and data is organized in a collection

Routed Search

This mechanism works the best when we know while querying from which shard all the documents should be queried from. We can pass _route_ param along with the query to redirect query to correct shard. Here in the example below we have sharded the collection on the basis of location. Now when user want to search “samsung” product in stores at Jakarta it will automatically redirect the query to correct shard.

Solr use MurmurHash64 to hash the route, then determines which bucket it falls into. Buckets are created whenever we create a new collection and the range can be seen in the Solr admin UI. In the example above hash function returned 25, which lies in between 21 to 32 of shard3.

For BliBli we have the same sharding strategy for the merchants selling products in multiple stores all over Indonesia. Merchant code and the store location are used to create composite key and all the data is served from one single shard.

This type of system is easy to scale, because let’s say if any one or multiple shards are over shooting the index, we simply split the shard.

Distributed Search

There comes a scenario where you don’t know from which shard the search results will come from. This is the type of architecture which we followed back in ’21 Q3 for online product collection, where 95% of search request in BliBli lands. Here is how it works,

Fetch TOP_IDS: clients asks to fetch the results for the query “samsung”, it hit any one replica of “each” shard to fetch just the top document Ids and score (thus this phase is called as fetch TOP_IDS). Basically query q=samsung&fl=id,score&shards=shard_i where i ={1, 2, 3}, parallely in each shard. On our example below documents were present in shard1 and shard3 only.

Fetch GET_FIELDS: Now, the results are aggregated and sorted on the basis of lucene score. The rest of the fields are fetched in the second phase, from the targeted shards (in our example just shard1 and shard2). Query will look something like this, q=id:x,y,z&fl=*&shards=shard1 (considering in TOP_ID it fetched x, y and z documents from shard1) and q=id:p,q&fl=*&shards=shard2 (and p and q from shard2)

The results are stitched together, and returned to the client. This phenomenon can be easily traced by passing “debug=track” in the request and observing the requestPurpose field.

There will be TPᵢₙₜₑᵣ Solr calls for single search page Solr call. Here N is the number of shards. The first N calls will gather the top ID, facet counts (if facet=true) and lucene score i.e EXECUTE_QUERY → GET_TOP_IDS along with GET_FACETS. In the second part, ‘i’ calls are made to fetch the meta data i.e GET_FIELDS which will contact respective shards and gather the data from any of the replia of that shard.

TPᵢₙₜₑᵣ= (N + i) ∀ i ∈ {0, 1,2, … N}

TPᵢₙₜₑᵣ increases with the number of shards linearly, thus making these types of system difficult to scale upto a certain extent. We will discuss more on this below.

Checklist for designing the system

Now we know how we query in a multi sharded collection, here are few things you need to be careful while migrating your collection to multi shard. Please make a note this checklist is subjected to distributed search only.

Dealing with skewed data distribution

One should know how the data is distributed in the collection, there can be scenario that the data can be highly skewed resulting in imbalanced shard index size. For example, possibly there will be too many stores in Jakarta as compared to other provinces in Indonesia because it’s the country capital.

Skewness can be easily verified from Solr admin UI > Nodes, under Disk Usage in Solr v8.0.0 and above. Here is official Solr documentation which tells how you can use 2 levels of routing to reduce the skewness upto an extent.

And here is how Apple manages to tackle the issue by automating the shard split, in their work flow.

Determining correct number of shard

The most common question we get how many shards is right amount for us. Well it all “depends”, upon the workload, query type and how many documents are we dealing with. For starters, we use this thumb rule to determine the shard, let’s say my collection can handle upto ‘x’ number of documents, currently it has ‘y’ documents, and yearly increase in catalogue size is ’n’ times the previous year. And we wanna make the system future proof till ‘z’ years

num_shards = ceil(y*(1 + z*n)/x)

That will give a rough idea, but to verify the numbers we projected the number of documents we will expect in the next 5years and ran the test with shard number as variable and plotted mean latency vs number of shard graph.

baseline vs extrapolated collection with variable # shards

For the collections having routed search, as mentioned above they are easy to scale, because each search request is dependent on specific subset or single shard only. But with distributed search approach it gets trickier to handle significantly more inter shard calls happening, consuming the JVM threads.

Hardware configuration

To have a higher availability, you have to need more VMs to power the Solr cluster, but because index size will be small we can have very less resources on each of the VMs. In our case, we found out after the migration we reduced the overall CPU cores and memory of our entire collection as compared with single shard but increased the total number of VMs.

For the initial capacity planning we calculated the total number of cores and memory we have in single shard. Let’s say we have 10 nodes with each having 4cores and 20gb memory, so in total 4*10=40cores and 20*10=200gb of memory. I will try to fit and rearrange the replicas within 20nodes of 2cores and 10gb. This really helped us in determining the total capacity of multi sharded collection in finite resource. Later while moving ahead to production we can upscale according to our requirement.

Replica Placement

BliBli’s data center is hosted in Google Cloud Platform, Singapore southeast region. When the replica’s are arranged across different regions, there can be network over head while distributing the requests. You can refer Google’s real-time cross region latency matrix from here.

Hence while arranging the replica we make sure the replicas are evenly distributed in nodes hosted in all the regions.

Configurations worth considering

Few things you may wanna configure to avoid possible disruption as soon as production traffic hits your new collection:

ShardHandlerFactory thread pool configuration: As mentioned above with the increasing shard in the limited number of nodes can increase inter solr calls, that will eventually increase the number of active threads in JVM. Here is the official documentation.
ShardHandlerFactory connection timeouts: Now there is inter dependency of one shard to another. Any perf bottle neck in one of the shard replica can quickly escalate to other shards as well, hence it’s become necessary to quickly release the threads by configuring appropriate lower timeouts. You will find the configuration in the same documentation above.
Partial results: Consider the same situation as above, in case of complete or partial shard outages, Solr by defaults fails the request any one of the shard returns non 2xx status. You may want to update the solr config or pass this param with every request
Single Pass mechanism: With the sufficient network bandwidth and while we are fetching only few fields, we can enable distrib.singlePass to completely eliminate FETCH_METADATA phase completely, reducing the number of active JVM threads. But it may lead to other issues like memory utilization and network bandwidth reaching limits.

Conclusion

BliBli search saw huge improvement when we moved to distributed search in-terms of response time by whooping 65% and reduced infrastructure cost. Which helped to sail through BliBli ’21 anniversary event, with no hiccups.