Searching email address in Elasticsearch
You may have heard about Apache James, the enterprise mail server which we use in OpenPaaS project to provide mail feature. The reason behind this choice is obvious: James is complete, stable, secure and extendable. And guess what, Linagora is one of main contributors to James ✌️.
One of many features we have contributed is the search feature built on top of Elasticsearch, a popular and open source search engine. It works perfectly until recently when my boss came to complain about invalid search results when he wanted to search emails by his email address: he got results which do not relate to this email addresss at all!
Actually, my boss is wrong, let me explain: the results relate to his email address because they all have the same domain than his email address. But he was right to complain because those results are not what he wanted to get. What he wanted are the emails containing his email address.
So what’s happened?
To understand what has happened under the hood, let’s take one step deeper into the way Elasticsearch does when we search an email address.
The search engine has a concept of tokenizer that receives a stream of characters, breaks it up into individual tokens and this happens at both index time and search time. Which means when I index a sentence:
My email address is firstname.lastname@example.org
It will be splitted into following terms:
[My, email, address, is, john, domain.com]
Now when I search for “email@example.com”, it will be splitted into “alice” and “domain.com”. As you can see, there is a common term in both index and search terms, which is “domain.com”, so above sentence matched!
To fix this, we must prevent Elasticsearch from splitting the email address, but how and at which step? Let’s see it in next section.
Elasticsearch, plz stop splitting email address
Fortunately, Elasticsearch does not restrict us from using other algorithms to tokenize our sentence. We will need an algorithm that does not split our email address but still split other parts of the sentence:
[My, email, address, is, firstname.lastname@example.org]
We need to apply this algorithm in search time too, so when I search for “email@example.com”, it does not split my search query but looks for results having that full email address. So the above sentence will not be included in search results since it does not have Alice’s email address.
The solution seems to work, but let’s check another case, what if I want to search for “john”? The above sentence should match but it does not because it contains “firstname.lastname@example.org” term. Urgggg!!! OK, so it must have “john” and also “domain.com” so it will work when I want to search for all emails in the same domain.
The desired splitted terms must look like:
[My, email, address, is, john, domain.com, email@example.com]
Of course, we must keep the full email address in the search query otherwise we will face again the problem we are trying to resolve in this article.
Let’s write that algorithm you say, but wait, Elasticsearch can do it for you.
Elasticsearch, can you do your job?
What? Elasticsearch does it for me? You ask and I answer Yes.
Elasticsearch provides a number of built in tokenizers, normally you use the standard one which split the email address. The tokenizer you want to use is uax_url_email, which will not split the email address. To use this tokenizer, you need to define a custom analyzer:
There is a problem: the uax_url_email tokenizer does not split the email address so it does not give us the desired terms we want above. To deal with this, we use multi-fields feature provided by Elasticsearch, which allows us to index the same field in different ways:
Now the “content” field uses the standard analyzer while the “content.custom” field uses our custom analyzer. So, when we search for “john”, it matches the sentence in “content” and if we search for “firstname.lastname@example.org”, it matches the sentence in “content.custom”.
We also need to tell Elasticsearch to use our custom analyzer at search time to prevent it from splitting email address in search query:
The final index setup looks like:
Elasticsearch is a great search engine, it is simple enough to be understood quickly and powerful enough to solve our real life problems. My colleague told me that “if you understand analyzer and mapping, you understand Elasticsearch”. The fact shows that it did not take long time to understand those concepts and we can quickly fix the email searching problem.