Introduction to Apache Solr

Sonam Nigam
TechShots
Published in
4 min readMay 19, 2019
source

What is Solr?

Apache Solr is an open source search platform built upon a Java library called Lucene. It is capable of improving the search features of a website by providing a full-text search and performing indexing in real-time. Lucene uses the concept of inverted index i.e. it inverts a page-centric key (page->words) to a keyword-centric key (word->pages) for faster retrieval of search results.

Features of Solr

  1. Advanced Full-Text Search.
  2. Optimized for high traffic.
  3. Near Real-Time Indexing.
  4. Easy sharding and replication with Zookeeper.
  5. Atomic updates with optimized concurrency.
  6. Supports pagination, sorting, facets.
  7. Supports spell checking, text highlighting and auto-suggestion.

Data Representation in Solr

Everything in Solr is stored as a document which is a collection of fields. Collection of documents together form an index. Field is nothing but a key value representation of the data.The definition of a field, name, field type, analyzers, such information is stored in the schema.xml file and all the Solr configurations are stored in solrconfig.xml.

Field Types

Solr expands the variety of field types available in Lucene.

  1. Primitive field types like float, double, long, date, text.

Defining “text” field.

2. User-defined field types — Solr allows the user to define custom field types, by combining filter and tokenizers.

Defining the user-defined field using filter and tokenizer.

Defining a Field in Solr

  • name: name is the field of the name.
  • type: type of field e.g. text, long
  • indexed: should the field be a part of an inverted index?
  • stored: should the original value be stored?
  • multiValued: can multiple values be assigned to the field?

Analyzer and tokenizers

Whenever a data is added to Solr it goes through a series of transformation both before indexing and querying. The transformation includes lower-casing, removing stem words, white space removal etc. After analysis, the data is converted into a series of tokens which are then added to the index or queried based on the requirement. Hence making the search more efficient.

If a field is not indexed, it cannot be searched. When we are displaying search results to the user, they would want to see the original documents, not the machine-processed tokens, this is why we use stored attribute so that Solr stores the original text in the index.

Working of Solr

Before moving further let’s understand how inverted indexing works.

The inverted index consists of unique words that appear in documents along with document id in which it appeared. For eg, if we have two documents with document id as doc_1 and doc_2. Each document has a field “description” with the following values:

  • “It is used to detect the blood pressure in the body”
  • “high blood pressure may have an adverse effect on the body”

Before saving the data to inverted index the content of field, description, is broken down into tokens (based on analyzers and tokenizers used). Finally, these unique tokens are saved in sorted order in the index. For the current example, the inverted index would look like

If we want to search the term “high blood pressure”, the search result would be,

As Doc_2 has more number of matches than the Doc_1, hence Doc_2 would be returned. This is how inverted-indexing works.

Pictorial representation of working of Solr

The data is stored as an inverted index in Solr core. The core is a separate entity which can be used to store data or running queries on that set of documents. When a user runs a search in Solr, the search query is processed by request handler. Solr supports a variety of request handlers like Standard Query Parser, DisMax query parser, and the Extended DisMax parser. To help users decrease the search sample space, Solr supports two special ways of grouping search results i.e. faceting and clustering.

Solr Audience

If you have,

  1. A high volume of free form text data that needs to be searched or grouped.
  2. Demand for very flexible full-text search querying e.g. spell check, text-highlighting.

Difference between RDBMS and Solr Search Engine

For getting updates for interesting articles related to tech and programming to join TechShots.Its a start to a long journey. We will love the developers to be a part of this and publish blogs related to any tech they like. You can also send us suggestions at techshotscommunity@gmail.com.Your feedback is very valuable.

References

--

--