How to Deploy Apache Solr as SolrCloud on HDFS in a Production Cluster

Harsh Jain
Apr 27, 2016 · 6 min read

Here is a high level overview of how to deploy Solr if approached cautiously.

  1. Requirements
  2. Install
  3. Edit Configuration files to work with HDFS
  4. Changing to ‘Solr’ user
  5. Adding Nodes
  6. Starting SolrCloud
  7. Creating a collection
  8. Verify & Enjoy

These directions were drafted using this Lucidworks Install guide and are only meant to help avoid pitfalls and provide additional notes. Please read and understand the notes & warnings in the Lucidworks guide before beginning.


#Requirements for this guide

Solr 5.5.0 requires Java 1.7 and higher.

HDP 2.3 or 2.4

Apache ZooKeeper — HDP 2.3 or 2.4 and Solr both use Apache ZooKeeper to manage services for the cluster. The ZooKeeper ensemble that you are using for HDP 2.3 or 2.4 can also be used by Solr.

This guide doesn’t include Kerberos requirements.


#Install Lucidworks-HDPsearch package

yum install lucidworks-hdpsearch

After installation, the HDPSearch files will be found in this directory.

/opt/lucidworks-hdpsearch 

Install only — DO NOT START SOLR SERVICE YET

The Lucidworks HDP Search package should be installed manually on each node of the cluster. Follow this guide on 1 machine first. Then go and start service on each node in cluster (See Adding nodes section below)


#Notes on Config files for HDFS

It provides central configuration for a cluster of Solr servers, automatic load balancing and fail-over for queries, and distributed index replication. When setting up SolrCloud, you need to make some modifications to some Solr-specific configuration files before starting Solr.

The configuration files are uploaded to ZooKeeper and managed centrally for all Solr nodes. The config files are located in:

/opt/lucidworks-hdpsearch/solr/server/solr/configsets

Personal tip: I choose the ‘data_driven_schema_configs’ because it allows Solr to create a schema on the fly without you first having to define it. If you have a specific schema in mind, choose from the other two configs.

The following changes only need to be completed for the first Solr node that is started. After the first node is running, all additional nodes will get their configuration information from ZooKeeper.


#Modifing Config files for HDFS

Find the solrconfig.xml file in the configset you will customize for your first collection. Within that file, find the section for <directoryFactory>. It will most likely look like this:

<directoryFactory name=”DirectoryFactory”
class=”${solr.directoryFactory:solr.NRTCachingDirectoryFactory}”>
</directoryFactory>

We will want to replace this with a different class, and define several additional properties.

<directoryFactory name=”DirectoryFactory” class=”solr.HdfsDirectoryFactory”>
<str name=”solr.hdfs.home”>hdfs://<host:port>/user/solr</str>
<str name=”solr.hdfs.confdir”>/etc/hadoop/conf</str>
<bool name=”solr.hdfs.blockcache.enabled”>true</bool>
<int name=”solr.hdfs.blockcache.slab.count”>1</int>
<bool name=”solr.hdfs.blockcache.direct.memory.allocation”>true</bool>
<int name=”solr.hdfs.blockcache.blocksperbank”>16384</int>
<bool name=”solr.hdfs.blockcache.read.enabled”>true</bool>
<bool name=”solr.hdfs.nrtcachingdirectory.enable”>true</bool>
<int name=”solr.hdfs.nrtcachingdirectory.maxmergesizemb”>16</int>
<int name=”solr.hdfs.nrtcachingdirectory.maxcachedmb”>192</int>
</directoryFactory>

You can copy paste the whole code block and replace the existing code.The only lines that you will need to change are:

“solr.hdfs.home — This is the address of the directory where your indexes will be stored on the HDFS file browser.

hdfs://: keep exactly same

Hostname: full name of your name node on the cluster. Eg. Namenode01.company.com

Port: this is the port for the rpc address of your namenode. It can be found in Ambari under HDFS>settings>Advanced> . Default is usually 8020

/user/solr: this is the directory under which your solr indexes will be saved. This is the path on your HDFS file browser. You may define it differently if you desire.

*optional*

If your cluster uses Kerberos, please see additional information at the link Starting Solr with Kerberos before starting Solr.


# Changing to Solr user

Sudo su — solr(input password)

#Starting Solr

/opt/lucidworks-hdpsearch/solr/

To start Solr, the script is quite simple, such as:

bin/solr start -c (1)
-z 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 (2)
-Dsolr.directoryFactory=HdfsDirectoryFactory (3)
-Dsolr.lock.type=hdfs (4)
-Dsolr.hdfs.home=hdfs://host:port/path (5)

The start command for the bin/solr script.

(1) The -c parameter tells Solr to start in SolrCloud mode.

(2) The connect string for the ZooKeeper ensemble. These can be found in Ambari under zookeeper> settings. Default port is 2181. We give the addresses for each node of the ZooKeeper ensemble in case one is down; we will still be able to connect as long as there is a quorum.

(3) The Solr index implementation you will use; this parameter defines how the indexes are stored on disk. In this case, we are telling Solr all indexes should be stored in HDFS.

(4) The index lock type to use. Again, we have defined hdfs to indicate the indexes will be stored in HDFS.

(5) The path to the location of the Solr indexes in HDFS. This is the same path defined in the config file above. If you changed the directory ‘/user/solr’, make sure this reflects your custom path.

IMPORTANT NOTE: If you do not specify a ZooKeeper connect string with the -z property, Solr will launch its embedded ZooKeeper instance. This instance has a single ZooKeeper instance, so provides no failover and is not meant for production use.

Note we have not defined a collection name, a configuration set, how many shards or nodes we want, etc. Those properties are defined at the collection level, and we’ll define those when we create a collection.


#Adding nodes to Solr

Navigate to the node —

ssh <nodeName>yum install lucidworks-hdpsearch

Change to solr user — see above.

Repeat #Starting Solr section above. The command & settings will be exactly the same as before so you can copy/paste.

Repeat for each node.


#Creating your first collection

bin/solr create -c SolrCollection (1)
-d data_driven_schema_configs (2)
-n mySolrConfigs (3)
-s 2 (4)
-rf 2 (5)

(1) The create command for the bin/solr script. In this case, the -c parameter provides the name of the collection to create.

(2) The configset to use. In this case, we’ve used the data_driven_schema_configs configset. If you modified a configset to support storing Solr indexes HDFS, as above, you should instead use the name of the configset you modified.

(3) This will be the name of the configset uploaded to ZooKeeper. This allows the same configset to be reused and very similar configsets to be differentiated easily.

(4) The number of shards to split the collection into. The shards are physical sections of the collection’s index on nodes of the cluster.

(5) The number of replicas of each shard for the collection. Replicas are copies of the index which are used for failover and backup in case of failure of one of the main shards.


#Verify & Enjoy!

<hostname:8983>/solr

8983 is the default port for solr; If above doesn’t work check to verify your solr port & firewall settings.

Click on Cloud in the left navigation pane.

You should see a visual of your collection with 2 shards & 2 replicas in green.


And that’s it!

Once again, this is meant to be a more detailed guide based on the Lucidworks install guide. I wrote this to help you avoid some pitfalls and share notes from my experience. This is by no means comprehensive and especially doesn’t include security integration.

If you have any questions or are getting funky errors feel free to leave a comment, or reach out to me on twitter @imharshj.

Harsh Jain

Written by

You know, the only thing I’ve never lost -is curiosity. Engr || Entrepreneur. Perpetual Wanderlust. Literary nerd. @themoochapp #BigData #Analytics #Datascience