Configuring Solr for Optimum Performance

Apache Solr is widely used search engine. There are several famous platforms using Solr; Netflix and Instagram are some of the names. We have been using both Solr and ElasticSearch in our applications, here at tajawal. In this post, I am going to give you some tips on how to write optimized Schema files. We will not be discussing the basics of Solr and I expect you to be aware of how it works.

While you can just get away with defining the fields and some defaults in the Schema file but you will not be getting the necessary performance boost. There are certain key configurations that you have to pay heed towards. In this post, I will be discussing these configurations that you can use to get most out of Solr in terms of performance.

Without further ado, let’s get started with what these configurations are.

1. Configure Cache

Solr caches are associated with a specific instance of an Index Searcher, a specific view of an index that doesn’t change during the lifetime of that searcher.

Configuring cache is the most important step in order to maximize the performance.

Configure `filterCache`:

Filter cache is used bySolrIndexSearcher for filters. The filter cache allows you to control over how filter queries are handled in order to maximize performance. The main benefit of FilterCache is when a new searcher is opened, its caches may be prepopulated or “autowarmed” using data from caches in the old searcher. So it will definitely help to maximize the performance. For example:

<filterCache 
class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"
/>

class: the SolrCache implementation LRUCache (LRUCache or FastLRUCache) 
size:
the maximum number of entries in the cache 
initialSize: the initial capacity (number of entries) of the cache. (see java.util.HashMap) 
autowarmCount: the number of entries to prepopulate from and old cache.

Configure `queryResultCache` & `documentCache`:

queryResultCache cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a sort, and the range of documents requested.
documentCache cache holds Lucene Document objects (the stored fields for each document). Since Lucene internal document IDs are transient, this cache is not auto-warmed.

You can configure both of them depending upon your application. It give better performance in scenarios where you have mostly readonly use cases. 
Consider you have a blog, a blog can have Posts and Comments upon posts. In case of Post we can enable these caches as database reads in this case is way more than writes. So in this case we can enable these caches for Posts. 
For example:

<queryResultCache 
class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"
/>
<documentCache 
class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"
/>

If you have mostly write-only use cases keep queryResultCache and documentCache disabled as upon every soft commit these caches get flushed and will not have that much performance effect. So keeping in mind the above mentioned blog example, we can disable these caches in case of Comments.

2. Configure SolrCloud

Nowadays, cloud computing is very popular which allows you to manage scalability, high availability and fault tolerance. Solr has capability to setup cluster of Solr servers that combines fault tolerance and high availability.

While setupSolrCloud environment you can configure “master” and “slave” replication. Use “master” instance for indexing the information and multiple slaves (need based) for querying the information. In the solrconfig.xml file on the master server, include following configuration:

<str name="confFiles">
solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml
</str>

Have a look at Solr Docs for further details.

3. Configure `Commits`

In order to make data available for search, we have to commit it to the index. There are some cases when commits could be slow when you have billions of records, Solr provides you more control over when data is committed using different options to control the commits timing, you will have to choose the option based on your application.

“commit” or “softCommit”:

You can simply commit the data to the index by sending commit=true parameter with update request, it will do the hard commit to all the Lucene index files to stable storage, it will ensure that all index segments should be updated and it could be costly when you have large data.

In order to get the data immediately available for search, you can use an additional flag softCommit=true, it will commit your changes to the Lucene data structures quickly but not guarantee that the Lucene index files are written to stable storage, this implementation is called Near Real Time, a feature that boosts document visibility, since you don’t have to wait for background merges and storage (to ZooKeeper, if using SolrCloud) to finish before moving on to something else.

autoCommit:

autoCommit setting controls how often pending updates will be automatically pushed to the index. You can set a time limit or max updated docs limit to trigger this commit. It can also be defined while sending the update request by using `autoCommit` param. You can also define is in Request Handler as below:

<autoCommit>
<maxDocs>20000</maxDocs>
<maxTime>50000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>

maxDocs: The number of updates that have occurred since the last commit.
maxTime: The number of milliseconds since the oldest uncommitted update
openSearcher: Whether to open a new searcher when performing a commit. If this is false, the commit will flush recent index changes to stable storage, but does not cause a new searcher to be opened to make those changes visible. The default is true.

There are also some cases where you can disable autoCommit altogether, for example, if you are migrating millions of records from a different datasource to Solr, you don’t want to commit the data upon every insert or even in case of bulk you don’t need it for every 2, 4 or 6 thousands insertions as still it will slow down migration. In such case you can completely disable `autoCommit` and do the commit at end of migration, or you can set this to something large, say 3 hours (i.e. 3*60*60*1000). You can also add <maxDocs>50000000</maxDocs> which means an auto commit happens only after 50 million documents are added. After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall.

Also after you are done with your bulk import, reduce maxTime and maxDocs, so that any incremental posts you will do to Solr will get committed much sooner.

4. Configure Dynamic Fields

One of the amazing feature of Apache Solr is dynamicField. It’s very handy when you have hundreds of fields and you don’t want to define all of them.

A dynamic field is just like a regular field except it has a name with a wildcard in it. When you are indexing documents, a field that does not match any explicitly defined fields can be matched with a dynamic field.
For example, suppose your schema includes a dynamic field with a name of *_i. If you attempt to index a document with a cost_i field, but no explicit cost_i field is defined in the schema, then the cost_i field will have the field type and analysis defined for *_i.

But you have to be careful while using dynamicField, don’t use it extensively as it has some drawbacks as well, as for dynamic fields if you use projection (like “abc.*.xyz.*.fieldname”) to fetch the specific columns, it takes time to parse the fields using regular expressions. Which adds the parsing time as well while returning the query result, below is the example to create a dynamic field.

<dynamicField 
name="*.fieldname"
type="boolean"
multiValued="true"
stored="true"
/>

Using a dynamic field means you can have infinite number of combinations in the field name as you have specified wildcard, sometimes it could be costly, as Lucene allocates memory for each unique field (column) name, this means if you have a row with columns A, B, C, D, and another row with E, F, C, D, Lucene will allocate 6 chunks of memory instead of 4 as there are 6 unique column names, so even with 6 unique column names, in case of million rows, it can crash the heap as it will use 50% extra memory.

5. Configure Indexed vs Stored Fields

Indexing a field means you are making a field searchable, indexed="true" makes a field searchable, sortable and facetable, for example, if you have a field named test1 with indexed="true", then you can search it like q=test1:foo, where foo is the value you are searching for, so, set only those fields to indexed="true" which are required to do a search, the rest of the fields should be indexed=”false” if you need them in search results. For example:

<field name="foo" type="int" stored="true" indexed="false"/>

Which means we can reduce the reindexing time, as upon every reindex, Solr applies the filters, tokenizers and analyzers which adds some processing time, if we’ll have less number of indexes.

6. Configure Copy Fields

Solr provides very nice feature called copyField, it is a mechanism to store copy of multiple fields to a single field. The usage of copyField depends upon scenarios but the most common one is to create a single “search” field that will serve as the default query field when users or clients do not specify a field to query.

Use copyField for all general text fields and copy them to one text field, and use that for searching, it will reduce the index size and give you better performance, for example, if you have dynamic data like ab_0_aa_1_abcd, and you want to copy all the fields having postfix _abcd to one field. you can create a copyField in schema.xml like below:

 <copyField source="*_abcd" dest="wxyz"/>

source: The name of the field to copy
dest: The name of the copy field

7. Use Filter Query ‘fq’

Using Filter Query fq parameter in search is very useful to maximize the performance, it defines a query that can be used to restrict the superset of documents that can be returned, without influencing score, it caches the query independently.

Filter Queryfq can be very useful for speeding up complex queries, since the queries specified with fq are cached independently of the main query. When a later query uses the same filter, there’s a cache hit, and filter results are returned quickly from the cache.

Below is the curl example to use filter query:

POST
{
"form_params": {
"fq": "id=1234",
"fl": "abc cde",
"wt": "json"
},
"query": {
"q": "*:*"
}
}

Filter qeury parameter can also be used multiple times in single search qeury. Have a look at Solr Filter Qeury docs for further details.

8. Use Facet Queries

Faceting in Apache Solr is used to classify the search results into different categories, it could be very helpful to perform the aggregation operations like grouping by specific field, count, grouping and etc, so, for all the aggregation specific queries you can use Facet to do the aggregation out of the box, it would be a performance booster as well because it’s purely made for these kind of operations.
 Below is the curl example to send a facet request to solr.

{
"form_params": {
"fq" : "fieldName:value",
"fl" : "fieldName",
"facet" : "true",
"facet.mincount": 1,
"facet.limit" : -1,
"facet.field" : "fieldName",
"wt" : "json",
},
"query" : {
"q": "*:*",
},
}

fq: Filter Query
fl: Fields List to be returned in the result
facet: true/false to enable/disable facet counts
facet.mincount: To exclude the ranges with a count below 1
facet.limit: Limit groups number to be returned in result, -1 means all
facet.field: The field should be treated as facet (to group the results)

Conclusion:

Performance improvement is a critical step when bringing Solr to production. There are many tuning knobs in Solr that can help you to maximize the performance of the system, some of them we have discussed in this blog, make changes in solr-config file to use optimal configurations, updating schema file using appropriate indexing options or field types, using filter queriesfq as much as possible and use appropriatecache options, but again it depends upon your application.

And that wraps it up. I hope this post was helpful, if you would like to learn more, have a look at the Solr reference linked below