SOLR Integration on AEM as a Cloud Service

Tushar Bias
2 min readSep 21, 2022

--

SOLR is an enterprise grade secure, highly scalable, fast, open-source NoSQL search platform from Apache Lucene with major features of powerful full-text search, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQLR.

To index the entire website you can use web crawler for the first time only. There are several web crawlers are available like Nutch, jSpider, Heritrix etc. and web crawler is also provided by SOLR.

If real time content update is not needed, then web crawler can be executed on scheduled basis to index the content via sitemap.xml

This is the second option where you can build custom application which can read the content from CDN (so that content get cached on all layers) and then pass the content to SOLR for further indexing. In this approach, custom application will invoke the AEM servlet with a page path as a parameter and servlet will return the content of a page in the form of a JSON.

As servlet data get cached on dispatcher & CDN, you can also use TTLs in this case to flush the cache instead of writing custom logic.

SOLR has couple of more options to crawl the content:

Post Tool — Option 3

SOLR includes a simple command line tool for posting various types of content to a SOLR server. The tool is bin/post. The bin/post tool is a Unix shell script; for Windows (non-Cygwin) usage.

This tool is meant for use by new users exploring SOLR’s capabilities and is not intended as a robust solution to be used for indexing documents into production systems.

Read more about post tool refer here.

SOLR’s Web Crawler — Option 4

Use SOLR capabilities to crawl using WebCrawler method

https://solr.apache.org/docs/5_4_1/solr-core/org/apache/solr/util/SimplePostTool.html

A very simple crawler, pulling URLs to fetch and then recurses N levels deep if recursive>0.

Links are parsed from HTML through first getting an XHTML version using SolrCell with extractOnly and followed if they are local. The crawler pauses for a default delay of 10 seconds between each fetch, this can be configured in the delay variable. This is only meant for test purposes, as it does not respect robots.

** Both these options (3 & 4) are not recommended for production setup

--

--