ElasticSearch Autocomplete Email Analyzer

Andrew Dieken
5 min readJun 3, 2024

--

Photo by Glen Carrie on Unsplash

I recently worked on a web application that required robust user search functionality by email. While common practice and numerous resources exist for analyzing email fields to support searches, I noticed a gap in addressing the autocomplete search functionality. This article will walk you through creating and using an ElasticSearch autocomplete email analyzer to meet three main search criteria:

  1. Full email address: The user copies and pastes the entire email address into the search bar.
  2. Email domain: The user searches for all users associated with a specific email domain.
  3. Partial email address (Autocomplete): The user starts typing an email address from left to right, and suggestions are provided as they type.

Supporting user search by email is essential for many applications, but implementing effective autocomplete search adds an extra layer of complexity and usability. Let’s delve into how to build an ElasticSearch analyzer to handle these cases, especially focusing on autocomplete.

Understanding the Search Criteria

Full Email Address Search: A user pastes “john.doe@example.com” into the search bar. The search should directly match this email address.

Email Domain Search: A user types “example.com” to find all email addresses belonging to this domain, such as “john.doe@example.com” and “jane.doe@example.com”.

Partial Email Address Search (Autocomplete): Autocomplete provides search suggestions as the user types. For instance, as the user types “jo”, the search should suggest email addresses like “john.doe@example.com” or “joe.smith@example.com”.

Side Note on ElasticSearch Text Analysis

ElasticSearch text analysis consists of three parts:

  1. Character filter(s): Transforms the original text by adding, removing, or changing characters, e.g. stripping HTML elements.
  2. Tokenizer: Breaks up the original text into individual tokens (usually individual words), e.g. whitespace.
  3. Token filter(s): Add, remove, or change tokens, e.g. lowercasing.

Analyzing Email Fields for Full Email and Domain Search

To support full email and domain searches, we need to create a custom analyzer with specific components:

Character Filters

None, as raw email text doesn’t need any preprocessing prior to tokenization.

Tokenizer

We’ll use the ElasticSearch uax_url_email analyzer:

A tokenizer of type uax_url_email which works exactly like the standard tokenizer but tokenizes emails and URLs as single tokens.

Using “john.doe@example.com” as an example, a single token will be emitted:

[
{"token": "john.doe@example.com"}
]

Token Filters (Applied in Order)

  1. Pattern Capture Token Filter: We’ll create a custom pattern capture token filter:

The pattern_capture token filter emits a token for every capture group in the regular expression. Patterns are not anchored to the beginning and end of the string, so each pattern can match multiple times, and matches are allowed to overlap.

This token filter will generate tokens based on each criterion we expect our users to search for. All we need is a single regular expression, ([^@]+). We’ll also make sure the original email address is preserved.

💡 Note: Feel free to add/remove regular expressions as you’d like e.g. (\p{L}+), (\d+), @(.+). I found this single one covered all my cases.

Continuing with our example, three tokens will be emitted:

[
{"token": "john.doe@example.com"},
{"token": "john.doe"},
{"token": "example.com"}
]

2. Lowercase Token Filter: We’ll use the ElasticSearch lowercase token filter:

Changes token text to lowercase.

This won’t produce any additional tokens. For our example, the tokens won’t change:

[
{"token": "john.doe@example.com"},
{"token": "john.doe"},
{"token": "example.com"}
]

3. Unique Token Filter: We’ll use the ElasticSearch unique token filter:

Removes duplicate tokens.

This will remove any duplicate tokens, optimizing space storage. For our example, the tokens won’t change:

[
{"token": "john.doe@example.com"},
{"token": "john.doe"},
{"token": "example.com"}
]

Here’s our analyzer definition with the custom token filter:

{
"analysis": {
"filter": {
"email_token_filter": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": ["([^@]+)"]
}
},
"analyzer": {
"email_analyzer": {
"filter": [
"email_token_filter",
"lowercase",
"unique"
],
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
}

Adding Autocomplete Search

The third search criterion can be accomplished with the addition of a single custom token filter. This token filter will be applied directly after the custom pattern capture token filter we’ve already created.

We’ll create a custom edge n-gram token filter:

Forms an n-gram of a specified length from the beginning of a token.

Applying this edge n-gram filter will produce partial tokens from the existing three tokens, ranging in length from 2 to 20, allowing us to support autocomplete search by matching against partial user input.

💡 Note: Since we’ll be searching against the full original email address, I found 20 characters works well as it allows the user to provide enough characters for unique results while keeping the total number of tokens to a minimum.

Continuing with our example, 36 tokens will be emitted:

[
{ "token": "jo" },
{ "token": "joh" },
{ "token": "john" },
{ "token": "john." },
{ "token": "john.d" },
{ "token": "john.do" },
{ "token": "john.doe" },
{ "token": "john.doe@" },
{ "token": "john.doe@e" },
{ "token": "john.doe@ex" },
{ "token": "john.doe@exa" },
{ "token": "john.doe@exam" },
{ "token": "john.doe@examp" },
{ "token": "john.doe@exampl" },
{ "token": "john.doe@example" },
{ "token": "john.doe@example." },
{ "token": "john.doe@example.c" },
{ "token": "john.doe@example.co" },
{ "token": "john.doe@example.com" },
{ "token": "jo" },
{ "token": "joh" },
{ "token": "john" },
{ "token": "john." },
{ "token": "john.d" },
{ "token": "john.do" },
{ "token": "john.doe" },
{ "token": "ex" },
{ "token": "exa" },
{ "token": "exam" },
{ "token": "examp" },
{ "token": "exampl" },
{ "token": "example" },
{ "token": "example." },
{ "token": "example.c" },
{ "token": "example.co" },
{ "token": "example.com" }
]

After applying the lowercase and unique token filters, 29 tokens are emitted:

[
{"token": "jo"},
{"token": "joh"},
{"token": "john"},
{"token": "john."},
{"token": "john.d"},
{"token": "john.do"},
{"token": "john.doe"},
{"token": "john.doe@"},
{"token": "john.doe@e"},
{"token": "john.doe@ex"},
{"token": "john.doe@exa"},
{"token": "john.doe@exam"},
{"token": "john.doe@examp"},
{"token": "john.doe@exampl"},
{"token": "john.doe@example"},
{"token": "john.doe@example."},
{"token": "john.doe@example.c"},
{"token": "john.doe@example.co"},
{"token": "john.doe@example.com"},
{"token": "ex"},
{"token": "exa"},
{"token": "exam"},
{"token": "examp"},
{"token": "exampl"},
{"token": "example"},
{"token": "example."},
{"token": "example.c"},
{"token": "example.co"},
{"token": "example.com"}
]

Here’s our final analyzer definition with all custom token filters:

{
"analysis": {
"filter": {
"email_token_filter": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": ["([^@]+)"]
},
"edge_ngram_token_filter": {
"type": "edge_ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"email_analyzer": {
"filter": [
"email_token_filter",
"lowercase",
"unique",
"edge_ngram_token_filter"
],
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
}

With email fields analyzed using the custom analyzer we defined, you’ll be able to support searches by:

  • Partial email address: Users can search by a part of the email address, e.g. joh, john.do, john.doe@exam.
  • Domain: Users can search by the domain part of the email, e.g., example.com.
  • Full address: Users can search by the complete email address, e.g., john.doe@example.com.

To implement this, the email field on your Index (e.g. Account) would be configured as follows:

{
"account": {
"mappings": {
"properties": {
"email": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keyword"
}
},
"analyzer": "email_analyzer"
}
}
}
}
}

This mapping ensures that the email field is analyzed using the email_analyzer we defined earlier, allowing for efficient searches across usernames, domains, and partial and full email addresses.

An example query to search for a user by email username might look like this:

{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "john.doe",
"fields": ["email", "email.raw"],
"type": "best_fields"
}
}
]
}
}
}

This query leverages the multi_match query type to search across both the analyzed email field and the raw field, ensuring comprehensive results whether the user searches for a partial email address or the full email address.

--

--