Search and Intelligence — 1

Introduction to Solr

In this era of Big Data, it is easy to get lost in this ocean. As intense is the power of massive amount of data, so is the intensity of futility of unable to locate something crucial. To be able to deal with this high velocity, high volume and highly unstructured data, we have seen the rise of NoSQL technologies. These technologies are use-case specific and are largely based on the pattern of saving large documents in collections, building specific index on top of that to serve the purpose of blazing fast retrieval. Similar is the case of textual search. With social media, every other business getting digitalised and natural language coming into play for uninterrupted experience from usual daily life, textual search has become unavoidable. Blazing fast textual search even more.

This article is about a full-text API abstraction over Apache Lucene i.e. a full text Search Engine: Apache Solr. Apache Solr is designed to handle extremely large volume of text-based documents. The key component of Solr is to return results in sorted order. This sorting is based on relevance. Apart from default Solr relevance ranking mechanism, we can implement our own re-ranking scheme, even based on machine learning models.

Solr is written in Java. Basically, it is a server. It is run inside a Jetty container (we can also use Tomcat). Indexing is based on Apache Lucene. And all operations can be performed using REST-apis over HTTP in JSON/XML. Ideally, there is no requirement for a Spring template like integration framework (though it is sometimes used). Because of its REST-ful nature, it can be integrated with applications in any language or framework. To conclude, a document can be indexed (added or updated) and queried over HTTP.

A production architecture of Solr cluster is typically of master-slave type. A primary database contains real data, however, a searchable meta-data is indexed into Solr. It is described in the picture.

Typical Production Setup for Solr

Solr works with documents. Each documents has multiple fields. And fields contain data. These fields are flat in nature and not nested (however, there is a concept of “copy field” but then the copied field is not a subfield, but an independent field containing copied data). A field can store primitive data type such as strings, integers, floats or booleans. A field can also be multivalued e.g. it can contain a list of strings. Values in this field are defined with a “fieldType” and it has to be specified if the value is to be stored or indexed or both. Solr should not be used for storing data rather it should be employed in a read-heavy application.

So, this is how Solr is used in an application. We define core (equivalent to collections in mongoDB). In each core we define a schema. A schema contains fields. Each field is defined with a type and certain other meta. A query is made on whole core, with terms that match to specific fields in the documents.

Simple expectations with solr are as follows:

  • Return matching documents within milliseconds.
  • Take care of spelling mistakes. Or repeated characters. Or case mismatch.
  • Take care of linguistic variations in case of natural language.
  • Omit common words, and search for specific keywords only. Omit symbols. Or use synonyms of symbols like “dollar” for $ or “ninety” for “90”.
  • Return a cursor of data, in case satisfactory results are missing in first set of results.

To serve this purpose simply creating the fields that match to a query will not work. For this purpose Solr provides with Analyzers. Each field that needs to serve an expectation of loose match, rather than perfect string match can be defined with “custom fieldType”. This “fieldType” can be then specified with Index Analyzer and Query Analyzer.

For example, a data that needs to be indexed and stored name: Rajendra Prasad Singh can be searched as name: rajendr prasad singh. The expectation is the that even if lower case misspelt search query is done, most relevant result is returned. So, the way we will index the data would be to make it lower case (LowerCaseFilterFactory). So our original data is name:rajendra prasad singh. We also add a fuzzy of 0.80 to the query. So, that for every 10 characters 2 characters can be misspelt, roughly. So, documents with name value within edit distance 2 can match (so rajendr matches to rajendra with an edit distance of 1). Let’s say the query is name: rajendra, we still would want this document to be returned as the first name matches. For this, we use a tokenizer that splits the word and indexes. So, name: Rajendra Prasad Singh after lower cases filter becomes name: rajendra prasad singh and after tokenizing (StandardTokenizerFactory) becomes name: [rajendra, prasad, singh]. Now if the query is name: rajendra this particular new field matches. So, this is how Analyzers, Tokenizers and Filters are used to index data. Similar processing can be done for the query to that particular field. For instance, if the query is name: RAJENDRA, this should still match. For this we add a Query analyzer that has LowerCaseFilterFactory. This enables the incoming query to become name: rajendra i.e. it gets converted to lower case. There are very useful analyzers that ships with Solr. BeiderMorseFilter can be used for phonetic matching. SynonymFilterFactory for similar words. StopWordFilterFactory to omit common names.

An example of a custom fieldType definition.

<field name="name_edge" type="autocomplete_edge" indexed="true" stored="true" />
<fieldType name="autocomplete_edgespaced" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.PatternReplaceFilterFactory" pattern="([\.,;:-_])" replacement=" " />
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="3"/>
<charFilter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ ])" replacement="" />
</analyzer>
      <analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.PatternReplaceFilterFactory" pattern="([\.,;:-_])" replacement=" " />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ ])" replacement="" />
<charFilter class="solr.PatternReplaceFilterFactory" pattern="^(.{30})(.*)?" replacement="$1" />
</analyzer>
</fieldType>

To be continued…