Enhancing AEM Lucene Search: Advanced Techniques for Improved Search Functionality : Part 2

Published in

Activate AEM

14 min readJul 18, 2024

Hey there, AEM users!

Struggling to find the content you need on your AEM site? Ever felt like you’re searching for a needle in a haystack? We’ve all been there. But fret no more! This series of blog posts is your guide to a supercharged AEM search experience.

We’ll be diving into some cool features that will make finding content a breeze. Think lightning-fast results, real-time suggestions as you type, and even highlighted keywords to pinpoint what you need instantly.

In our previous article, we explored the magic of search suggestion servlets — those helpful prompts that appear as you type, guiding you towards the content you crave. Now, let’s shift gears and focus on the foundation of a great search experience: AEM indexing management.

Optimizing Search with AEM Indexing Management

This article equips you with the knowledge to optimize your AEM search for accuracy, efficiency, and user satisfaction. We’ll delve into the core concepts of indexing management and explore best practices to keep your search results on point, even with a massive amount of content.

Stay tuned for more!

This series is just getting started. In the coming weeks, we’ll be exploring:

Craft Compelling Search Results: Highlighting Keywords & Controlled Excerpts : Making your search results super user-friendly by highlighting keywords and showing only the most relevant info.
Empower User Searches: Synonyms, Filters & Spell Check : Expanding search capabilities to include synonyms and filters, and helping users out with those pesky typos.

Previously Covered in this series:

Search Suggestion Servlets

Enhancing AEM Lucene Search: Advanced Techniques for Improved Search Functionality- Part 1

Hey there, AEM users!

medium.com

Index Management

A fundamental aspect of Adobe Experience Manager (AEM) is its indexing management, which plays a pivotal role in optimizing content search capabilities within the platform. Effective indexing is essential for ensuring that users can quickly retrieve relevant content. AEM’s indexing management revolves around various strategies aimed at enhancing the search experience.

Key components of AEM indexing management include:

Relevant Properties Inclusion:

AEM indexing management prioritizes the inclusion of relevant properties in the index. By selectively indexing properties that are crucial for search relevance, AEM ensures that search results are accurate and meaningful.

Sample Index:

<indexRules jcr:primaryType="nt:unstructured">
                <nt:unstructured
                        jcr:primaryType="nt:unstructured"
                        includePropertyTypes="[String]">
                        <properties jcr:primaryType="nt:unstructured">
                                <hideInNav
                                        jcr:primaryType="nt:unstructured"
                                        index="{Boolean}false"
                                        name="hideInNav" />
                                <offTime
                                        jcr:primaryType="nt:unstructured"
                                        index="{Boolean}false"
                                        name="offTime" />
                                <onTime
                                        jcr:primaryType="nt:unstructured"
                                        index="{Boolean}false"
                                        name="onTime" />
                                <allowedTemplates
                                        jcr:primaryType="nt:unstructured"
                                        index="{Boolean}false"
                                        name="cq:allowedTemplates" />
                                <childrenOrder
                                        jcr:primaryType="nt:unstructured"
                                        index="{Boolean}false"
                                        name="cq:childrenOrder" />
                                <designPath
                                        jcr:primaryType="nt:unstructured"
                                        index="{Boolean}false"
                                        name="cq:designPath" />
                                <resourceType
                                        jcr:primaryType="nt:unstructured"
                                        name="sling:resourceType"
                                        propertyIndex="{Boolean}true"
                                        weight="{Long}0" />
                                <resourceSuperType
                                        jcr:primaryType="nt:unstructured"
                                        name="sling:resourceSuperType"
                                        propertyIndex="{Boolean}true"
                                        weight="{Long}0" />
                                <prop
                                        jcr:primaryType="nt:unstructured"
                                        analyzed="{Boolean}true"
                                        isRegexp="{Boolean}true"
                                        name="^[^\\/]*$"
                                        nodeScopeIndex="{Boolean}true"
                                        useInExcerpt="{Boolean}true" />
                                <textField
                                        jcr:primaryType="nt:unstructured"
                                        boost="{Decimal}0.01"
                                        index="{Boolean}true"
                                        name="textField" />
                        </properties>
                </nt:unstructured>
        </indexRules>

1.1 propertyIndex:

The propertyIndex attribute determines whether a property should be included in the Lucene index. When set to true (like with sling:resourceType and sling:resourceSuperType in our example), the property’s values are incorporated into the index and become searchable.
Properties like hideInNav likely have a dedicated attribute named index set to false to exclude them from indexing. The propertyIndex attribute is specifically used for properties you want to manage within the context of AEM indexing, allowing you to control their searchability.

1.2 weight:

The weight attribute is used in conjunction with the propertyIndex attribute for properties of type sling:resourceType and sling:resourceSuperType. It allows you to specify a significance value for these properties within the search ranking.
In the provided configuration, both sling:resourceType and sling:resourceSuperType have a weight=”{Long}0" setting. This means they are included in the index (propertyIndex=”{Boolean}true”) but aren’t given any extra weight in the ranking algorithm (weight=”{Long}0"). Setting the weight to 0 explicitly communicates our intent. It clarifies that these properties, while indexed, shouldn’t receive any additional weight in the ranking algorithm beyond Lucene’s potential default boost (often 1.0).
Sets a baseline significance for the entire property within search ranking. It acts as a multiplier for the property’s overall influence.
Example: A weight of 2 for the brand property indicates that documents matching the brand term will be considered twice as important as those matching description or price terms (assuming weight 1). This can significantly influence ranking even if terms aren’t boosted within the property.

1.3 boost:

In AEM indexes built upon Lucene, boost is a factor assigned to a specific property or field that influences its importance during the search process. Properties with a higher boost value are considered more relevant by the search engine and will be ranked higher in search results.
In the provided configuration, you’ll see the boost attribute used with the textField property here. Each of these has a boost=”{Decimal}0.01" setting. This indicates that content matching these properties will receive a slight increase in ranking compared to other properties without a boost.
This influences the importance of individual terms extracted from the property value during indexing. However, it’s not explicitly defined in the configuration.The boost value we set for the entire property in the AEM configuration (e.g., “boost=0.5” for the title property) might be applied to these individual terms created by Lucene during tokenization.

Understanding the Nuances of Boost and Weight

Key Differences:

Granularity: Boost applies to individual terms within a property, increases/decreases importance of individual terms, while Weight applies to the entire property, Sets baseline significance for the entire property.
Customization: Boost allows for more granular control over the importance of specific terms. Weight focuses on the overall significance of a property.

Scenario

Imagine we have an e-commerce site with products that have properties like title, description, brand, and price. We want to prioritize searches based on title and brand while considering descriptions and price to some extent.

Boost:

Boosting Titles: We can assign a boost to the title property. Let’s say a boost of 0.5. This means terms found in the product title will have a significant impact on the search score. A product with a title that exactly matches the search term will rank higher compared to products where the term only appears in the description.

Example: Search Term: “Running shoes”

Product A: Title: “Nike Running Shoes — Zoom Pegasus”, Description: “Comfortable running shoes…” (Boost applied to Title)
Product B: Title: “Everyday Sneakers”, Description: “These shoes are great for running…”

In this case, Product A would likely rank higher because the search term “Running shoes” is present in its boosted title.

Weight:

Prioritizing Brands: We can set a higher weight for the brand property compared to description and price. Let’s say a weight of 2 for brand and a weight of 1 for the others. This tells the search algorithm that brand is a more important factor for ranking than descriptions or prices.

Example: Search Term: “Adidas shoes”

Product A: Title: “Running Shoes”, Brand: “Adidas”, Description: “…” (High Weight for Brand)
Product B: Title: “Adidas Comfort Shoes”, Brand: “Unknown”, Description: “These are comfortable shoes…”

Here, Product A might rank higher even though Product B has “Adidas” in its title (without boost). This is because the weight assigned to the brand property gives it more influence in the overall ranking.

If we introduce a boost for the title property in the scenario, the ranking dynamics between Product A and Product B would change.The outcome depends on the relative strength of the boost applied to the title and the weight assigned to the brand property.

1.4 analyzed=”{Boolean}true”:

The analyzed attribute controls how the property’s value is processed before being indexed. When set to true (like with the prop property in our example), Lucene performs text analysis on the property value. This analysis typically involves breaking text into individual terms, lowercase conversion, and potentially stemming (reducing words to their root form).
By enabling analysis, you ensure Lucene can effectively understand and match the content of the property during searches.

1.5 isRegexp=”{Boolean}true”:

The isRegexp attribute applies to the prop property in our example and is set to true. This indicates that the name attribute (defined as ^[^\\/]*$) for the prop property specifies a regular expression.
Regular expressions are powerful tools for pattern matching in text. In this case, the provided regular expression ^[^\\/]*$ matches any string that doesn’t contain a forward slash character (/). This might be useful for indexing property names that follow a specific format or convention within our content.

1.6 nodeScopeIndex=”{Boolean}true”:

The nodeScopeIndex attribute applies to the prop property and is set to true here. This setting instructs Lucene to index the ‘property’ value not only on the current node but also on all child nodes below it in the content hierarchy.
This can be beneficial if the property you’re indexing exists on multiple levels of your content structure and you want searches to find matches regardless of the node depth. To say, enabling nodeScopeIndex in AEM ensures that searches on pages (cq:Page nodes) return results even when the content is located within child nodes beneath the cq:Page node.

Example: With nodeScopeIndex enabled for the title property of a product page, searches will consider the title from both the page node and any child content nodes.

1.7 useInExcerpt=”{Boolean}true”:

The useInExcerpt attribute applies to the prop property and is set to true here. This setting allows Lucene to include the value of the prop property in search result excerpts.
Search results often display a short snippet of the content around the matched keywords. By enabling useInExcerpt, you’re making the prop property’s content eligible to be included in the excerpts, potentially improving the context and clarity of the search results.

2. Exclusion of Unnecessary Properties:

Conversely, unnecessary properties are excluded from indexing. This approach streamlines the indexing process and prevents cluttering of the index with irrelevant data, leading to more efficient searches and avoiding false-positive matches.

Sample Index Snippet:

 ...
        excludedPaths="[/var,/etc/replication,/etc/workflow/instances,/jcr:system]"
        includedPaths="[/content/kiran-sample]"
        ..
        queryPaths="[/content/kiran-sample]"
        type="lucene">
        .....

Let’s break down the properties in the above AEM index configuration:

2.1excludedPaths=”[/var,/etc/replication,/etc/workflow/instances,/jcr:system]”:

This property specifies paths that are deliberately excluded from the indexing process. Here’s what’s excluded:

/var: This folder typically stores temporary data and configurations, which are not relevant for searching content.
/etc/replication: This path likely contains replication-related configurations that don’t need to be searched.
/etc/workflow/instances: This folder holds workflow instance data, not actual content for searching.
/jcr:system: This is the root folder for the content repository system itself and shouldn’t be included in searches.

2.2 includedPaths=”[/content/kiran-sample]”:

This property defines the specific path that will be indexed. In this case, only the content located at /content/kiran-sample will be included in the index. This ensures the index focuses on the relevant content and avoids unnecessary data.

2.3 queryPaths=”[/content/kiran-sample]”:

This property plays a crucial role in search queries. It specifies the paths that the search engine will consider when evaluating search queries. Even though it has the same value as includedPaths in this example, there’s a subtle difference:

includedPaths determines what gets indexed.
queryPaths determines which paths are eligible to be matched during searches.

In essence, searches will only find results within the path specified by queryPaths. So, if you have subfolders within /content/kiran-sample, queries will only find content within those subfolders if they are included in queryPaths.

Imagine you have a website with product information and blog posts:

includedPaths: Set to /content/website to index both product and blog post content.
queryPaths: Set to /content/website/products only. This allows searches to find products, but excludes blog posts from search results even though they are indexed.

Why Use Different Values?

Here are some reasons you might choose different values for includedPaths and queryPaths:

Restricting Search Scope: As in the example above, you might want to exclude certain types of content from search results even though they are indexed for potential future use.
Security: You might index sensitive information but restrict search access to authorized users through queryPaths.
Performance Optimization: For very large indexes, you could strategically use queryPaths to limit the search area, potentially improving search performance.

Optimizing our excludedPaths

The provided sample excludes common system folders. Here’s an important point:

It’s generally more efficient to use includedPaths to specify the content you want to index. By default, system folders like /var and /etc are excluded. This approach ensures clarity and avoids the need to manage a potentially lengthy list of excluded paths, especially if your content structure is extensive.

However, excludedPaths can still be useful in specific scenarios. For instance, if you have a subpage under an includedPaths (like /content/kiran-sample/draft-pages) that you want to include in the index, you can leverage excludedPaths to achieve that granularity.

3.Aggregates

Aggregates consolidate content properties, enabling efficient retrieval of related information. By defining aggregates, AEM optimizes the organization of indexed data, enhancing the search experience for users.

Sample Index Snippet:

<aggregates jcr:primaryType="nt:unstructured">
  <nt:unstructured jcr:primaryType="nt:unstructured">
    <include0
        jcr:primaryType="nt:unstructured"
        path="textPar" />
    <include1
        jcr:primaryType="nt:unstructured"
        path="textPar/*" />
    <include2
        jcr:primaryType="nt:unstructured"
        path="textPar/*/*" />
    <include3
        jcr:primaryType="nt:unstructured"
        path="textPar/*/*/*" />
    <include4
        jcr:primaryType="nt:unstructured"
        path="textPar/*/*/*/*" />
  </nt:unstructured>
</aggregates>

Understanding Aggregation with textPar:

In AEM indexes built upon Lucene, aggregation allows you to accumulate values from the textPar property and its child properties up to four levels deep for indexing purposes. This can be beneficial when our content has properties with hierarchical structures.

Here’s a breakdown of how the include elements work with textPar:

include0 with path=”textPar”: This includes the direct values of the textPar property itself in the index.
include1 with path=”textPar/*”: This gathers the values of any child properties directly under textPar. For example, if textPar has child properties like heading and subheading, their values would be included.
include2 to include4 follow the same logic, progressively including the values of child properties within the previously included properties, reaching up to four levels deep from the initial textPar property.

Why Aggregate Properties?

The reasons for configuring aggregation remain the same:

Improved Search Relevance: By including values from child properties, searches might have a better chance of finding relevant content, even if the search term directly matches a value within a nested property under textPar.
Faceted Search Functionality: Aggregation can be useful for creating faceted navigation, where users can filter search results based on the values of these aggregated properties within the textPar hierarchy.

Important Considerations:

Extensive aggregation can increase the size of our index. Ensure you only aggregate properties that are truly valuable for searching to maintain optimal performance.
Define the appropriate depth of aggregation based on our content structure and search needs. Including values from very deep levels might not be necessary or beneficial.

4.Analyzers

Analyzers process text before indexing, applying various transformations to improve search accuracy. AEM provides flexible options for configuring analyzers, allowing users to tailor the indexing process to their specific requirements.

Sample Index Snippet:

<analyzers jcr:primaryType="nt:unstructured">
                <default jcr:primaryType="nt:unstructured">
                    <charFilters jcr:primaryType="nt:unstructured">
                        <HTMLStrip jcr:primaryType="nt:unstructured"/>
                        <Mapping jcr:primaryType="nt:unstructured"/>
                    </charFilters>
                    <tokenizer
                        jcr:primaryType="nt:unstructured"
                        name="Standard"/>
                    <filters jcr:primaryType="nt:unstructured">
                        <Synonym
                            jcr:primaryType="nt:unstructured"
                            format="solr"
                            ignoreCase="true"
                            synonyms="synonyms.txt">
                            <synonyms.txt/>
                        </Synonym>
                        <LowerCase jcr:primaryType="nt:unstructured"/>
                        <Stop
                            jcr:primaryType="nt:unstructured"
                            words="stop.txt">
                            <stop.txt/>
                        </Stop>
                        <PorterStem jcr:primaryType="nt:unstructured"/>
                    </filters>
                </default>
            </analyzers>

this code snippet configures an analyzer that performs the following actions on text before it’s indexed:

Removes HTML tags and potentially other markup (using HTMLStrip).
Performs character remapping if necessary (using Mapping).
Splits the text into individual terms (using StandardTokenizer).
Applies synonyms for improved search recall (using Filters).
Converts terms to lowercase for case-insensitive searching (using Filters).
Removes stopwords for efficiency (using Filters).
Applies stemming to potentially match different variations of the same word (using Filters).

4.1 charFilters:

CharFilters preprocess text by removing or modifying specific characters. AEM supports the configuration of character filters to ensure that indexed content is clean and standardized, leading to more consistent search results.

Character filters are like text cleaners that modify text before it gets indexed. They can perform various actions, including:

Lowercasing: Converting all characters to lowercase (e.g., “Hello World” becomes “hello world”). This ensures case-insensitive searching, where searches for “HELLO” or “hello” would yield the same results.
Accenting Handling: Converting accented characters to their base characters (e.g., “à” becomes “a”, “ä” becomes “a”). This improves search consistency by ensuring variations caused by accents don’t hinder results.
Symbol Removal: Removing punctuation marks, special symbols, or other characters deemed irrelevant for searching (e.g., removing “&” or “$” from text).

4.2 Tokenizers:

Tokenizers break down text into individual terms or tokens, facilitating more granular search queries. AEM allows the configuration of tokenizers to suit the linguistic characteristics of the indexed content, leading to more accurate search results.

Sample Index Snippet:

<tokenizer
                        jcr:primaryType="nt:unstructured"
                        name="Standard"/>

Lucene offers various tokenizer implementations, each with different functionalities. The StandardTokenizer is a general-purpose tokenizer that performs the following actions on the input text:

Splits text at whitespace characters: This includes spaces, tabs, newlines, etc.
Removes certain control characters: Characters deemed unsuitable for indexing are eliminated.

It removes characters that wouldn’t be helpful for searching, like weird symbols or formatting marks.

Eg: Strange symbol &%$# might disappear.

Handles alphabetic characters: Letters are separated into individual terms. It sees letters and separates them into individual terms, like building blocks for searching.

Eg: “Hello World” becomes “Hello” “World” (separate words)

Handles alphanumeric strings: Words containing both letters and numbers are typically kept as single terms. However, you might need custom configurations depending on your specific needs.

Eg: A product code like “Product101” stays together, but you might need special rules for some cases.

4.3 Analyzer Filters:

Filters are employed to refine search results based on specific criteria, such as metadata or content attributes. By adding filters, AEM enhances the precision of search queries and ensures that users find exactly what they are looking for.

Sample Index Snippet:

 <filters jcr:primaryType="nt:unstructured">
                        <Synonym
                            jcr:primaryType="nt:unstructured"
                            format="solr"
                            ignoreCase="true"
                            synonyms="synonyms.txt">
                            <synonyms.txt/>
                        </Synonym>
                        <LowerCase jcr:primaryType="nt:unstructured"/>
                        <Stop
                            jcr:primaryType="nt:unstructured"
                            words="stop.txt">
                            <stop.txt/>
                        </Stop>
                        <PorterStem jcr:primaryType="nt:unstructured"/>
                    </filters>

In essence, this code snippet configures Lucene to:

Use synonyms to improve search recall by matching content that uses synonymous terms.
Exclude stopwords to reduce index size and improve search efficiency.
Convert terms to lowercase for case-insensitive searching.
Apply Porter stemming to potentially match different variations of the same word.

We will see what Synonyms and Stop Words do in a while. Now, let’s first break down the LowerCase Filter & Porter Stemmer then.

4.3.1 LowerCase Filter:

The <LowerCase> element ensures all terms are converted to lowercase before further processing. This helps improve search matching as users might use different capitalization styles in their queries.

4.3.2 Porter Stemmer:

The <PorterStem> element applies the Porter Stemming algorithm. This algorithm reduces words to their root form (e.g., “running” becomes “run”). This can improve search recall by matching different variations of the same word.

Configuration of Synonyms and Stopwords:

AEM allows the configuration of synonyms and stopwords, which enriches the search experience by accounting for variations in language usage and excluding common, non-descriptive terms.

AEM indexes leverage Apache Lucene for searching, and Lucene offers functionalities to manage synonyms and stopwords during the indexing process. Here’s a breakdown of how you can configure them:

4.3.3 Synonyms:

Synonyms are words or phrases that have the same or similar meanings. By including synonyms in our index, you can improve search recall by allowing searches to match content that uses synonymous terms.

AEM doesn’t directly manage synonyms within the index configuration itself. Instead, it relies on an external synonym file. Here’s how to configure synonyms:

Create a Synonym File: Create a plain text file with the .txt extension. In each line of the file, specify synonyms separated by commas or whitespace.

For example: car, automobile buy, purchase

Reference the Synonym File in the Index: Within the <analyzers> section of our AEM index configuration, locate the <default> analyzer. Here, you’ll find a section named <filters>. Add a <Synonym> element with the following attributes:

format=”solr” (specifies the synonym file format)

ignoreCase=”true” (optional, instructs Lucene to ignore case when applying synonyms)

synonyms=”[path/to/your/synonyms.txt]” (replace with the actual path to your synonym file)

4.3.4 Stopwords:

Stopwords are very common words (e.g., “the”, “a”, “an”) that are often excluded from searches as they don’t significantly contribute to the meaning of a query. Removing stopwords reduces the index size and can improve search efficiency.

Similar to synonyms, AEM relies on an external stopword file for configuration. Here’s how to manage stopwords:

Create a Stopword File: Create a plain text file with the .txt extension. List each stopword you want to exclude on a separate line.

Eg:

Reference the Stopword File in the Index: Within the <analyzers> section and <default> analyzer of your AEM index configuration, locate the <filters> section. Add a <Stop> element with the following attribute:
words=”[path/to/your/stopwords.txt]” (replace with the actual path to your stopword file)

Sample Index Snippet:

```
<filters>
    <Stop jcr:primaryType="nt:unstructured" words="stopwords.txt">
        <stopwords.txt/>
    </Stop>
</filters>
```

Important Considerations:

By default, AEM doesn’t include pre-configured synonym or stopword files. You’ll need to create and manage these files yourself based on your specific requirements.
Choosing the right stopwords depends on your content type and search needs. You might need to experiment to find the optimal balance between efficiency and comprehensiveness.
Overuse of synonyms can lead to irrelevant matches in searches. Ensure your synonym list is relevant and well-maintained.

Hopefully this article has equipped you with the knowledge to optimize your AEM index management and deliver a superior search experience for your users.

Happy searching! And stay tuned for the next article on ”Craft Compelling Search Results: Highlighting Keywords & Controlled Excerpts “.

Enhancing AEM Lucene Search: Advanced Techniques for Improved Search Functionality : Part 2

Optimizing Search with AEM Indexing Management

Enhancing AEM Lucene Search: Advanced Techniques for Improved Search Functionality- Part 1

Hey there, AEM users!

Index Management

Understanding the Nuances of Boost and Weight

Scenario

Written by Kiran Mayee N