Not this kind of Solr

Adventures in Solr Sharding

The Power of Powers of 2

Eric Griffith
Published in
7 min readJan 20, 2023

--

We’ve been using Apache Solr, an enterprise-level full-text search platform based on the Lucene search library, for over 6 years to power our applications’ full-text search and reporting aggregation. This is a story about the time that our choice of sharding strategy sent us scrambling for four days to fix a customer bug.

We have multiple Solr clusters, all running on 2 to 16 shards. I’ve had two basic assumptions for as long as I’ve been working with it:

  1. All the documents for a single customer will be contained within a single shard (because we use the customer id as the shard key).
  2. The number of shards should always be a power of 2 (e.g. 2, 4, 8, 16…).

Of course, assumptions are made to be challenged. As we were planning our Solr upgrade the subject of our unbalanced shards came up. At that time, we had an 8-shard cluster with 2 of the shards hosting more than 50m documents and the others hosting between 24m and 38m. We were lamenting that we would have to deploy 16 shards, and they would still be unbalanced. At one point, someone (probably me) asked “why do we have to shard on powers of 2? Can’t we just have 10 shards?”

It was an offhand comment that, absent pushback, became a ten-shard cluster for our most-trafficked search instance a few months later. Everything was appropriately sized; the cluster hummed along nicely.

Our shards still grew lopsided. After 6 months the Infrastructure team decided that two of the shards needed to be split — we could either reindex to a 12-shard cluster (a process that takes 2–4 weeks for 300m+ documents) or use the SPLITSHARD functionality and get the whole process done in 2–4 days. We hadn’t widely used the SPLITSHARD command on a production cluster previously, but from the documentation we knew that it was designed to do exactly what we wanted: create two (or more) shards out of a single shard, reducing the number of documents per shard to approximately half of the total on the original.

Since the cluster performance would be impacted by the process, we split the shard on a snapshot, rather than on production. We were able to stand up the re-sharded cluster in 3 days, cutting over on a Friday afternoon.

At midday on the following Monday, our Customer Success team escalated a ticket: one of our customers on that cluster was only seeing a few dozen documents in their search rather than the 10s-of-thousands they had. All docs were accessible, just not showing up in search, which suggested an issue with our Solr indices.

Through queries to Solr’s CLUSTERSTATUS api we were able to determine that the majority of the customer’s data resided on one of the new child shards (shard6_1), and a very small number of documents (all inserted after shard_6 was split) existed on that shard’s new sibling (shard6_2).

Eager to get things working for this customer, we initiated a reindex of their documents under the hypothesis that the SPLITSHARD command somehow put this customer on the wrong shard and reindexing would move it appropriately. While the reindex was running, we determined that this was not happening to any other customer (both by conferring with the Customer Success team and by examining the distribution of customer documents across all the shards to determine that no other customer had shard-spanning documents).

While the reindex for that customer was running, we got reports that things were looking better and their searches were starting to show thousands of documents rather than dozens. From our side, we noticed that, while a lot of documents were indexing to the expected shard (shard6_2), an equal number were indexing to the original shard (shard6_1).

Our expectation was that a customer would ALWAYS be completely contained within a single shard since we use customer id as the shard key. We based this expectation on the documents which, upon re-reading, maybe didn’t say what we had thought they said.

Every Solr document has a unique id which can be used to direct it to a specific location in the cluster. When formatted as <shard_key>!<document_id>, Solr will calculate the ring location by hashing each part (shard_ key and document_id) separately using murmurhash3, then taking the first 16 bits of the shard_key and the last 16 bits of the document_id to determine the document’s address. This gives an overall address space of 00000000 to FFFFFFFF, split across shards.

Our id is <customer_id>!<unique_document_id> which, when coupled with the addressing algorithm above, indicates that any single customer should be contained in the range

XXXX0000 to XXXXFFFF, where XXXX here equals the first 16 bits of the murmur-hashed customer_id.

This implies that if you want a customer to be contained fully within a shard, you just have to make sure that the address space for that shard falls on the 0000/ffff boundary, such as 8FFF0000–9CD2FFFF

Turns out, with 10 shards, that’s not how things work. While the initial setup was split on 0000/ffff boundaries (we hypothesize that in the initial setup Solr was clever enough to allocate shard_key-safe address ranges to all the shards despite the number of shards), as soon as we tried to split one of these shards, Solr didn’t have our backs. For example, shard6 above, started out with the address range of 80000000–9998FFFF. Splitting that exactly in half (as the SPLITSHARD functionality does without any other input) means that Shard6_1 ends up with the address range 80000000–8CCC7FFF and Shard6_2 ends up with 8CCC8000–9998FFFF, which means the customer at 8CCCxxxx ends up split across the shards. In the case of this one customer, the first 16 bits of the murmur-hashed id equalled — no surprise — -8CCC.

Why is splitting a customer across shards such a bad thing, at least in our case? We use two features (one incorrectly) that are not entirely cross-shard friendly: document routing (to speed up the search by directing queries to the correct shard by customer id) and Join Queries (referencing two different document schemas in a single query, e.g. referencing both an application and the job descriptions that application is attached to in a query). These caused problems for a couple of reasons.

  1. While, when used correctly, the _route_ parameter on queries will query from multiple shards, there is a performance degradation. (unfortunately, we were not using it correctly. We assigned “<customer_id>” to the route parameter, which could NOT search across shards, but we were able to fix that by assigning the param as “<customer_id>!”).
  2. Joining schemas in a query requires the use of the crossCollection join method, which requires some careful configuration and collection management decisions. In our case, successful crossCollection joins for this customer result in query times that were over 10,000% slower (queries went from ~50ms to well over 5s).

After enabling non-routed queries and crossCollection joins, we determined that it was not, with our current configuration, a viable process for our production instance.

Which brings us to the one bit of good luck: our original blue cluster was still running and still receiving updates from our duplicate indexing jobs. This gave us the opportunity to roll back, repair the desyncs, and reinstate full feature availability to this customer, albeit with unbalanced and bloated shards.

This left us with a number of tasks

  1. (Immediate) Reshard the large shards for this production cluster without allowing customers to split across shards (Possible by explicitly defining the address range when running the SPLITSHARD command).
  2. (In the next 2 quarters) Reindex all of our production clusters onto clusters with shard counts that are powers of 2.
  3. (within 2 years) Examine our schema design, sharding, and query strategies to be able to effectively include cross-sharded customer data in all necessary queries.

How does this relate to powers of 2? If you take the top value for the total address space (FFFFFFFF) and divide it by 2, it gives you 7FFFFFFF, then 3FFFFFFF, then 1FFFFFFF, 0FFFFFFF, 07FFFFFF, ….

Each F (hexadecimal) is four powers of two so we can divide the Customer address space (first four digits of the address) in half only 16 times before dividing the document address space. That gives us the ability to keep splitting shards until we have 32,768 shards before we are potentially split a customer across shards.

Example of a Solr Shard address spaces on a binary tree

And that, dear reader, is why we must always shard Solr in powers of 2.

Notes:

We don’t actually have to shard in powers of two, but choosing not to means that when splitting shards we will have to manually calculate an appropriate address space for each shard rather than letting Solr choose, and that way lies madness.

Also, this doesn’t mean that clusters will always have powers-of-2 shard counts, since splitting individual shards as needed will give us any number of shards that can be represented by a binary tree.

--

--