Elasticsearch — Setting up a synonyms search

Hey folks! Throughout this post, we will see how to configure a basic synonyms search using Elasticsearch (:

Regardless the expertise with Elasticsearch, you might be able to understand this basic implementation.

To keep this post short, we will go through the most basic implementation, I may write more about this topic in the future, let’s start! (:

Our Problem

Imagine that we are a Brazilian e-commerce selling laptops, we look at our search statistics and realize our users are searching for the word notebook and they get angry because they don’t get any result, that’s so bad :(

After some investigation, we noticed that in Brazil notebook is a synonym for laptop, all our competitors have updated they systems to work with that. It’s our turn to fix it, increase our sales and make our customers happy again :)

With this in mind, let’s start setting up the Elasticsearch environment.

Setting up the environment

We aren’t covering the basic usage of Elasticsearch, I’m using Docker to start the service and run it.

To start the Elasticsearch cluster run:

$ docker run -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:6.0.0

We are supposed to see something like this on the console:

...
[2017-11-24T21:19:48,835][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [3FHsLdj] publish_address {172.17.0.2:9200}, bound_addresses {0.0.0.0:9200}
[2017-11-24T21:19:48,836][INFO ][o.e.n.Node ] [3FHsLdj] started
[2017-11-24T21:19:48,923][INFO ][o.e.g.GatewayService ] [3FHsLdj] recovered [0] indices into cluster_state

Note: if you get the max virtual memory areas error, you can run sudo systcl -w vm.max_map_count=262144 , I also recommend you search for this error to understand this workaround.

Our Elasticsearch cluster is running, to make sure it’s working properly we can access the http://localhost:9200/ and we should see something like this:

{
name: "3FHsLdj",
cluster_name: "docker-cluster",
cluster_uuid: "d3TUN9siQiWAnziLqK3K7w",
version: {
number: "6.0.0",
build_hash: "8f0685b",
build_date: "2017-11-10T18:41:22.859Z",
build_snapshot: false,
lucene_version: "7.0.1",
minimum_wire_compatibility_version: "5.6.0",
minimum_index_compatibility_version: "5.0.0"
},
tagline: "You Know, for Search"
}

Or we call the REST API of our cluster:

$ curl -XGET 'localhost:9200/_cat/health?v&pretty'
epoch      timestamp cluster        status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1511623576 15:26:16 docker-cluster yellow 1 1 2 2 0 0 2 0 - 50.0%

Yeey!! Our cluster is working as expected, let’s create our index and set it up =)

Creating our index

To create an index we need to call the API using the HTTP method PUT, let’s create an index called laptops:

$ curl -XPUT 'localhost:9200/laptops?pretty'
{
"acknowledged":true,
"shards_acknowledged":true,
"index":"laptops"
}

Our index was successfully created, it’s time to insert some data!

If you need to understand the basic concepts such as index, access this link:

Inserting data in our index

We are moving forward, huh? That’s great!!! Shall we insert some data into our index? Yaaaz!!!

$ curl -XPUT 'localhost:9200/laptops/doc/1?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X1 i7 8gb RAM "
}
'

After inserting this doc we will get a result similar to this:

{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}

Great! Let’s insert more data…

$ curl -XPUT 'localhost:9200/laptops/doc/2?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X2 i5 4gb RAM "
}
'
$ curl -XPUT 'localhost:9200/laptops/doc/3?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X3 i3 2gb RAM "
}
'
$ curl -XPUT 'localhost:9200/laptops/doc/4?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop Z1 i7 6gb RAM "
}
'

Cool! We have 4 laptops on our Elasticsearch, let’s search for them :D

The notebook treasure map

As I said, in Brazil we use notebook as a synonym for laptop and when people search for notebook they usually expect a laptop as result, crazy, huh? Yaz, we are crazy :D

Ok, first let’s test the basic search request:

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "title": "
notebook i7 8gb" } }
}
'

Using a match with notebook i7 8gb we will get two results:

{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "4",
"_score" : 0.6931472,
"_source" : {
"title" : "Laptop Z1 i7 6gb RAM "
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM "
}
}
]
}
}

We got these results because this query is using i7 and 8gb, the notebook term was ignored. As a user, we expect to get just the laptops with i7 and 8gb, let’s change our query to return only when those 3 requirements match.

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "title": "
notebook" } },
{ "match": { "title": "
i7" } },
{ "match": { "title": "
8gb" } }
]
}
}
}
'

At this point we face the problems with synonyms , no results were returned for this search:

{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Why is this happening? This happens because we are using the match for 3 requirements notebook, i7 and 8gb and we don’t have a notebook in our Elasticsearch, so we need to configure our index to treat notebook as a synonym to laptop.

Configuring our index

The most basic configuration defines the synonyms directly in the configuration request. As we need to change the index settings the first thing we need to do is delete our index:

$ curl -XDELETE 'localhost:9200/laptops/?pretty'
{
"acknowledged" : true
}

Now we can recreate our index with the right analyzer and filter:

$ curl -XPUT 'localhost:9200/laptops/?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"index" : {
"analysis" : {
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
},
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
}'

Note: we can insert as many synonyms as we want, for simplicity we are using just the notebook and laptop.

Ok, let’s understand what we configured:

[...]
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
}

[...]

First, we created a filter called synonym_filter with the synonym type and with laptop and notebook as synonyms in the synonyms list, it will be used by our analyzer:

[...]
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
[...]

We created an analyzer called synonym_analyzer , this analyzer will use the standard tokenizer and two filters, the lowercase filter will convert all tokens to lowercase and the synonym_filter will introduce the synonyms into the tokens stream.

Ok, we have our analyzer working, now we need to map it with our field and make it possible to search by synonyms:

"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}

This configuration will map our title field to use our synonym_analyzer .

Cool! Now we have our basic settings in place, let’s open re-index our data and test our simple implementation :D

Testing

Let’s insert our data again and try to search:

$ curl -XPUT 'localhost:9200/laptops/doc/1?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X1 i7 8gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/2?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X2 i5 4gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/3?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X3 i3 2gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/4?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop Z1 i7 6gb RAM"}'

Ok, once we have our data we can start searching for it. First, we will search for just notebook :

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "title": "
notebook" } }
}
'

And we get… ALL RESSULTS!!!!

{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.41501677,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 0.41501677,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "3",
"_score" : 0.41501677,
"_source" : {
"title" : "Laptop X3 i3 2gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "2",
"_score" : 0.26302126,
"_source" : {
"title" : "Laptop X2 i5 4gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "4",
"_score" : 0.26302126,
"_source" : {
"title" : "Laptop Z1 i7 6gb RAM"
}
}

]
}
}

YEEEY!!! Let’s test to get a more specific result again, matching notebook, i7, and 8gb :

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "title": "notebook" } },
{ "match": { "title": "i7" } },
{ "match": { "title": "8gb" } }
]
}
}
}
'

BA DUM TSSS!!!!!!!! We got just one result:

{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0324807,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0324807,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM"
}
}

]
}
}

WOOWW!!! Our synonym search is working!!!! A.W.E.S.O.M.E!!!!

Wrapping up

There are other ways to configure the synonyms filter, the most common ways are putting them directly into the configuration file or using the synonym_path attribute and use a synonym text file holding our synonyms. The latter deserves its own post about it.

The main advice here is to play around and try do understand how it works, it could be beneficial use a synonym configuration for an e-commerce system for instance.

That’s it for today, hope you enjoyed our simple filter and had fun playing with me (:

See you!

Useful links

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.