The OG Search Engine Database: Solr

SHIVAM SOURAV JHA
5 min readJan 10, 2023

--

Index table

  1. General information about Solr.
  2. Inverted Index
  3. Data storage across nodes.
  4. Document routing.
  5. Static use of data in Solr.
  6. BKD-Tree.
  7. Conclusion
  8. Further helpful links.

Introduction

In the age of the ELK stack (Elasticsearch-Logstash-Kibana), be someone’s Solr, a battle-tested search engine that is open-sourced and mature.

We won’t be talking about which search engine came first or which is cheaper; let’s just talk about the beauty of its existence and how it stores data.

Like this image, Solr is just sitting on top of Lucene

Well, to address the how of data storage, it's the same as how Elasticsearch stores data because it's built on Lucene and uses an inverted index.

What is an inverted index?

It is a method of data storage where all the words appearing in a document are mapped to its document I so that we know wants the document appears.

"apple":1,2,3
"ball": 3,5
This means apple appeared in documents with ID 1,2,3 and ball appeared
in documents with ID 3,5

So now that we know how data is stored in Solr, let’s talk about how beneficial it is to use Solr. One of the issues that Elasticsearch often faces is data loss or non-precise search returns. Solr performs better in that condition because of how the data is placed.

Inverted-Index community members

How is data stored across nodes?

Earlier, Solr would store data across multiple shards as a collection and copy this data onto other shards to prevent data loss; now this is sort of a manual search because we check all the shards for our data, and for storing the data as well, Solr has to compute which shard to send data to, and there is no load balancer support.

Seeing all these issues, a new key player was introduced. Zookeeper is a sort of librarian that tells visitors where to pick up a book from or where to keep a book, eventually giving more precision and stability (by load balancing) to Solr.

Furthermore, because of this method of data storage (with Zookeeper in place), no nodes experience a split brain (split brain occurs when two nodes compete to become a leader).

How do I route to a document?

How do we route data to our target now that we know it is stored across shards?

This is accomplished through prefixes, which we can add on top of our document ID to determine which shard the pieces are placed in. For example, if we search "Apple!12344,” we would hash Apple to find which shard it is part of and find the document with ID 12344; “!” is used to differentiate the prefix.

Talking about fetching data, we must now know that Solr sends parameters in a query in headers.

You’ve reached the end shard, turn around and walk away.

Static use of data

Now that we know Solr is good with static data (which doesn’t change rapidly), we must also understand what features this static nature brings to the table.

  • Machine Learning: We can train and extract required features on top of Solr.
  • Caching: We cache all the segments of a shard, thus when one segment changes, every cache related to it is discarded.
  • Full-text search support: Solar is great in enabling a full-text search query.
  • Joins: Solr allows us to query by joins across different collections thus in the case of a nested JSON instead of index time parent-child handling, we can use joins.
  • Language analysis based on Lucene, multiple suggesters, spell checkers, rich highlighting support

Solr also returns our search results in various formats, including JSON, XML, and URL.

Use the response as per your use case

The GeoLocation saga

We can use Solr’s spatial search for:

  • Index points or other shapes.
  • Filter search results by a bounding box or circle or by other shapes.
  • Sort or boost scoring by the distance between points, or the relative area between rectangles.
  • Generate a 2D grid of facet count numbers for heatmap generation or point-plotting.

Back when Lucene 6.0 was released, it dumped the Trie data structure for searching distance and switched to the Block-K-D tree or the BKD tree. This tree structure helps us search over a range but is slow for exact searches.

B-K-D Tree

The BKD tree seems to be interesting, so let’s talk about it. A K-D tree (short for "k-dimensional tree") is a space-partitioning data structure for organizing points in a k-dimensional space.

K-D trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g., range searches and nearest neighbor searches) and creating point clouds. K-D trees are a special case of binary space partitioning trees. And the BKD is a special case the of KD tree which is dynamically scalar.

but now that I read it, I will write about it

Conclusion

So this was the conversation about how Solr stores data, how its static nature helps (because of its caches and the ability to use an uninverted reader for faceting and sorting), and how geolocation search allows us to search in a better way.

Must read:

  1. Solr’s intro.
  2. Solr’s fight with Elasticsearch
  3. BKD a dynamically scalable tree(https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf)
Do give follow for more such blogs

--

--