Getting Started: SOLR Indexing In AEM 6.3

Code and Theory
5 min readJun 19, 2018

Written by Juan Ayala

Introduction

At Code & Theory, we have experience with several indexing solutions such as SOLR, ElasticSearch and Amazon CloudSearch. We make a recommendation based on client needs, expertise and stack. On our latest AEM project, we decided to go with SOLR. The main reasons were:

  • Its an Apache project and AEM is built on Apache projects (i.e. Felix, Jackrabbit & Sling)
  • The Java client SolrJ and its dependencies are distributed as OSGi bundles and can be easily deployed to the Felix container.
  • It has a purely REST API giving us the option of querying directly from the front end, or through SolrJ on the backend.

When it comes to indexing AEM content using SOLR, success rests on several factors. A good taxonomy, an extensible suite of OSGi service components, good UX to create components that leverage the indexed data, and a scalable SOLR deployment. At Code & Theory, we do all this for our clients. This how-to, however, is targeted to AEM developers and architects wishing to start integration with SOLR. We’ll use Docker to run SOLR and within just a few minutes you’ll have a SOLR instance up and running, and shortly after that, you’ll be indexing some content. Finally, we’ll point out one little trick we used to index the textual content of a WCM Page.

Prerequisites

  • AEM 6.3 + SP2
  • Docker for your particular platform
  • Maven 3

Create AEM 6.3 Project

Create a new AEM project using the AEM Maven archetype. I am using version 13 as that is the version that will create an AEM 6.3+SP2 project. Refer to their README if you have another version. Run the following command:

Run SOLR In Docker

Create docker-compose.yml

Create a file in the aem-solr folder called docker-compose.yml and write the following contents into it. This will create a container using the SOLR Alpine image, creating a new collection, and storing the data on your host drive so that if the container shuts down, you won’t have lost any data. The official SOLR image on Docker Hub is really flexible.

Start SOLR

In the root of the aem-solr folder where you created docker-compose.yml, run this command, and then verify SOLR is up and running by accessing the web console at http://localhost:8983

$ docker-compose up -d

Create a Dependency Content Package

There are a few dependencies that we need that do not ship with AEM. Luckily, these are already distributed as OSGi bundles and all we need to do is deploy them into the Felix container. We need to create a separate content package to do this. Optionally we could embed them directly into our core bundle but the better practice is to deploy them separately to allow for easier upgrades.

Parent pom.xml dependencyManagement updates

Locate the parent pom.xml under the aem-solr folder and add the following dependencies under the <dependencyManagement> node. Always get in the habit of specifying your dependency versions in the <dependencyManagement> section of the parent POM. It makes for easier maintenance and upgrade.

Create dependencies content package Maven project

Create a folder called dependencies under the aem-solr folder. In the dependencies folder, write this pom.xml

Update Parent pom.xml module list

Add the new dependencies project to list of modules in parent pom.xml

Deploy The AEM Project

Now run mvn clean install -PautoInstallPackage -Padobe-public. Access the Felix at http://localhost:4502/system/console/bundles and you will see the 3 bundles have been deployed and started. You can view the sample content by accessing http://localhost:4502/content/aem-solr/en.html.

Index Your First Resource

Update the core bundle’s pom.xml

Now that you will start using SolrJ in Java code you’ll need to update the dependencies of the core bundle. Locate the core project’s pom.xml and add the following dependencies:

Create a Sling Servlet

Create a new Sling Servlet in the core bundle. All we are going to do is merely index the resource.

Execute the Servlet & Verify SOLR Index

The servlet will respond to the following url: http://localhost:4502/content/aem-solr/en/jcr:content.property.index.

After running it, verify the SOLR document was created. Go to the SOLR web console at http://localhost:8983/solr/#/aemsolr/query and click on the Execute Query button at the bottom of the query page. You should see your document in the list of results.

Beyond Just Indexing Properties

If you load up the sample content page at http://localhost:4502/content/aem-solr/en.html, you’ll notice that it has some Lorem Ipsum content. Where and how is this content stored? To make a long story short, this page has been implemented by using sling:resourceSuperType=”core/wcm/components/page/v2/page”. Open up the content in CRX/DE to view the structure: http://localhost:6302/crx/de/index.jsp#/content/aem-solr/en/jcr%3Acontent/root. Getting the page’s title and description was simple enough, but how do we index pages that can have an arbitrary number of child components in a responsive grid structure like the one used by the Core WCM Components? At best it would require an intimate knowledge of the taxonomy and a lot of if statements!

We had a similar situation with one of our clients. Their textual content was stored in several child components within a parsys, usually placed there by content authors. To capture the textual content without getting too deep into the taxonomy, we leveraged SlingRequestProcessor to process requests through Sling and get the rendered HTML.

Parent pom.xml dependencyManagement updates

We are going to leverage the Jsoup HTML parser so we can programatically get the textual content out of the HTML we will render. Locate the parent pom.xml under the aem-solr folder and add the following dependencies under the <dependencyManagement> node.

Dependencies pom.xml embeddeds update

Add the following to the <dependencies> node of the dependencies content package project

And the following to the <embeddeds> node

Core pom.xml dependencies update

Add the following to the <dependencies> node of the core bundle project

Create a Sling Servlet

In the core bundle project, create the following servlet. In this servlet we are leveraging the SlingRequestProcessor to render the resource as HTML, and the Jsoup parser to get the text only content from the HTML. We are also using SOLR’s partial updates feature to update the existing document that would of been created by the previous servlet. Otherwise we would of had to fetch it, update it, and save it. Or completely recreate it.

Execute the Servlet & Verify SOLR Index

The servlet will respond to the following url: http://localhost:4502/content/aem-solr/en/jcr:content.rendering.index.

After running it, verify the SOLR document was created. Go to the SOLR web console at http://localhost:8983/solr/#/aemsolr/query and click on the Execute Query button at the bottom of the query page. You should see your document in the list of results.

Conclusion

The examples given used servlets as a quick way to illustrate how to index a resource. In practice there are a multitude of ways to accomplish this. In our previous projects we encapsulated the Resource to SolrInputDocument mapping into an AdapterFactory, with a suite of supporting OSGi service components to control what and how content got indexed. Then we adapted resources to SolrInputDocument within event handlers, work flow processes and Sling jobs. But why stop at resources? Other content we’ve indexed include PDFs and yes, even images. With the index data in place our UX team designed page components to do things from site search to recirculation of article and news pages.

You can find the completed project on GitHub.

--

--

Code and Theory

Code and Theory is a leading, technology-first creative agency. It is the only with a balance of 50% creative and 50% engineers at scale.