Synonyms in Elasticsearch

Published in

Version 1

8 min readMar 22, 2022

Synonym — a word that means exactly or nearly the same as another word.

In this short post, we will run through a demo implementing synonyms for forenames in Elasticsearch, for example, matching people named “Anne” when we search for “Ann”.

We will also cover a few Elasticsearch basics as well.

tl;dr

How to use synonyms in Elasticsearch:

Get or create a CSV file with one group of synonyms per line, e.g. beverly,bev
Upload the CSV file to ALL your Elasticsearch Index nodes in the config directory.
Add the Synonym file name to your index as token filter in an analyser.
Add the analyser to any searches where you want to implement these synonyms.

Elastic Synonyms — Introduction

In search applications, synonyms are not a “nice to have”. Users rightly expect that if they search for a person named “Anne”, results will include people named “Ann” as well; or finding “Rajendra” when we search for “Raj”.

If you want to run the demo, you’ll need an Elasticsearch instance running with an Elasticsearch index node & a Kibana Server — we’ll use the Kibana development console throughout and interact with the Elasticsearch configuration through a Linux terminal.

If you don’t have an Elastic cluster, the Elastic Docker images are a good place to start.

Step 1 — create an Elasticsearch Index & add some data

An index in Elasticsearch is roughly analogous to a database — that is it’s a container of related data.

We will create a small index & add some data about people & their favourite colours using the Elasticsearch dev console.

The code below will add six records to a new index called ppl_idx, note that the index is created implicitly when we add the first record. Just copy & paste the code into the dev console & execute…

POST ppl_idx/_doc
{
  "firstName": "Mike",
  "middleName": "Dave",
  "lastName": "Smith",
  "favouriteColour": "yellow"
}POST ppl_idx/_doc
{
  "firstName": "Vanessa",
  "middleName": "",
  "lastName": "Jones",
  "favouriteColour": "white"
}POST ppl_idx/_doc
{
  "firstName": "Venkatesh",
  "middleName": "Rajendra",
  "lastName": "Kumar",
  "favouriteColour": "blue"
}POST ppl_idx/_doc
{
  "firstName": "David",
  "middleName": "Dennis",
  "lastName": "Dunstable",
  "favouriteColour": "yellow"
}POST ppl_idx/_doc
{
  "firstName": "Nessa",
  "middleName": "Cloe",
  "lastName": "White",
  "favouriteColour": "Teal"
}POST ppl_idx/_doc
{
  "firstName": "Rachel",
  "middleName": "Lisa",
  "lastName": "Topaz",
  "favouriteColour": "blue"
}

We can query the Index to check the data has loaded OK:

GET ppl_idx/_search
{
  "query": {
    "match_all": {}
  }
}

This call to the Elastic API uses the search method and simply returns all records. If all is well we will see six hits and the records returned as JSON:

Search command in the left pane, results in the right-hand pane

Step 2 — try some searches

So now, let's try to search for a few people, we’ll be a bit clever and use Elastic’s multi_match to search more than one field: both first and middle name for anyone called “david”. Searches are case-insensitive by default, in case you were wondering…

GET ppl_idx/_search
{
  "query": {
    "multi_match": {
      "query": "david", 
      "fuzziness": "auto",
      "fields": ["firstName","middleName^0.8"]
    }
  }
}

Also, note the ^0.8 next to “middleName”, this is a simple weighting and means matches on middle names are only 80% as important as forename matches. So forename search hits will appear before middle names in our results.

fuzziness is also set to auto— this allows Elasticsearch to perform a fuzzy match and perhaps catch search terms that nearly match like typos.

In this case, this will return just one record — David Dennis Dunstable, all the D’s!

BUT there was a record added for “Mike Dave Smith”; users would expect a search tool to understand that Dave and David are the same thing and return both records in the search result.

Onto synonyms…

Step 3 — create a synonyms csv and upload it to Elasticsearch

So we now need to tell Elasticsearch about forename synonyms, we added the following names to our index which could have synonyms:

Mike Dave Smith
Vanessa Jones
Venkatesh Rajendra Kumar
David Dennis Dunstable
Nessa Cloe White
Rachel Lisa Topaz

There are several synonyms here e.g. “Venkatesh” may be abbreviated to “Venkat”, Vanessa and Nessa are also synonymous. We can create a simple file to list these and save as a csv; I’ll call mine— names.csv

vanessa,nessa
venkatesh,venkat
rajendra,raj
mike,michael,mick,micky
david,dave
rachel,rach
dennis,den

Note that we can have as many synonyms on a line as we like, e.g. “mike” has four entries.

Now we upload the csv file to the config directory on each elastic index node:

names.csv synonyms file in the elastic config directory

Now we need to create an Elasticsearch analyser to implement our synonym list…

What is an Elasticsearch Analyser?

At a high level, an analyser controls the way Elasticsearch indexes documents. For example, you could add an analyser with an “Ascii Folding” token filter which would convert accented characters like “déjà vu” into basic Latin Unicode characters — “deja vu”

So, when you searched you could still match accented characters — note an analyser can contain multiple character & token filters performing different behaviour. This Elastic blog post goes into more detail.

Onward with our example….

Step 4 - Create an Analyser in Elastic

We need an analyser to reference our synonyms file. The following call to Elasticsearch adds a new analyser called names_analyser to our index called ppl_idx.

PUT /ppl_idx
{
  "settings": {
    "analysis": {
      "analyzer": {
        "names_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "names_synonyms"
          ]
        }
      },
      "filter": {
        "names_synonyms": {
          "type": "synonym",
          "synonyms_path": "names.csv",
          "updateable": true
        }
      }
    }
  }
}

So what have we done here? Let’s break it down…..

PUT /ppl_idx

Simple enough, a REST call to http PUT method, the payload is the following JSON document….

"settings": {
    "analysis": {
      "analyzer": {

These elements tell Elastic we want to make a settings change to the Analysis setting, specifically an analyser…

        "names_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "names_synonyms"

Here we define our analyser, we give it a name: names_analyzer

We then set the tokenizer, this tells Elastic how to break down search text into discrete parts (terms), in our case, the standard tokeniser will be just fine. See more about Tokenizers here.

Next, we tell the analyser the token filter(s) to use, and we define the built-in lowercase filter to set all search text to lower case — this is because our subsequent Synonym filter performs an Exact Match, so the case is important as names.csv is all in lower case.

Then, we also specify a token filter called names_synonyms, which is defined next…

"filter": {
        "names_synonyms": {
          "type": "synonym",
          "synonyms_path": "names.csv",
          "updateable": true

Nearly there! This is our names_synonyms filter previously referenced in the analyser. Type tells Elastic the behaviour to perform — synonym, lookup all words with the same meaning (e.g. rachel, rach) and add all to the search.

synonyms_path tells Elastic where to find the list of search terms to match when checking for synonyms.

updateable — when set to true, this allows us to update our synonym file and reload the analyser without needing to take our index offline.

We’re now ready to try our synonym analyser….

Step 5 — Implement our Synonym Analyser in a Search

Right! So all the pieces are in place, we just need to make a small modification to our search payload to implement our new analyser by adding “analyzer”: “names_analyzer” to the search.

We’ll do the same “david” search as at the beginning of the post, but hopefully, we’ll get two matches this time — adding “Mike Dave Smith” as well…

GET ppl_idx/_search
{
  "query": {
    "multi_match": {
      "query": "david", 
      "fuzziness": "auto",
      "fields": ["firstName","middleName^0.8"],
      "analyzer": "names_analyzer"
    }
  }
}

Running the query we now get two hits and both David and Mike Dave are returned, AND both have the same favourite colour! Who knew?

Step 5.1 — Testing Synonyms

There is another quick way to test a new analyser as well — the _analyze method.

Below we can test if the raj synonym is working by feeding in “hello raj” as test text and telling Elastic to analyse with our new analyser. We can then see how Elastic breaks the search down & handles tokens….

The result tokens show two Alphanumeric tokens for our search text and our synonym lookup “rajendra”, as a third token from our names.csv file. All three will be matched in a search.

Conclusion

In this post, we created a small search index in Elasticsearch containing data on names & favourite colours :-)

We then uploaded a synonym lookup file for forenames & added it to an analyser as a token filter. We then added the analyser to our index.

Finally, we modified our search request to use this analyser, results were then returned not just for the searched name but for any synonyms of that name. So searching for “David” returned anyone called “Dave” as well.

Final Thoughts — Bias & Clusters

It is never simple, is it?

A lookup file for synonyms has a real potential for bias in our search system. A lookup file restricted to western names or higher quality for female names can result in searches biased towards these groups.

Another approach may be to consider Phonetic search where words are matched based on sounding similar, this should work very well with our “anne” vs. “ann” example but would it work with shortened versions of names?

The Elasticsearch Phonetic Analysis Plugin implements this behaviour. If I get a chance to try it I’ll be sure to write a post regarding my findings.

Also, if we have a multi-node Elasticsearch cluster, we need to keep the synonyms lookup file identical across all nodes — any differences will mean the same search potentially returning different results each time!

Final, Final word

I’m not claiming to be an expert on Elasticsearch, but I struggled to find clear documentation to get synonyms working, and this post is a mash of many documents and posts I read. I hope it helps a few people.

If you do find ways to do this better\easier please drop a comment below — I’d love to learn more.

About the Author:

Mike Knee is An Azure Data Developer here at Version 1.

Synonyms in Elasticsearch

Written by Mike K