Elasticsearch: mapping and analyzers

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

Elasticsearch is reputed to be schema-less, which means you can quickly insert documents to an index without specifying a schema beforehand. This differs from traditional relational databases where you have to explicitly specify tables, fields and fields types.

Under the hood though, Elasticsearch enforces a schema called mapping which describes the fields in the documents along with their data types and how they should be indexed by Lucene.

If you do not specify an explicit mapping, Elasticsearch will generate one for you based on the first document you insert. The mapping will also be automatically updated as new fields are added to the documents. Let’s see how it works.

Dynamic mapping

Let’s index some user profiles in Elasticsearch. Each user profile has a name, a date_of_birth and a description. The date_of_birth is formatted as a timestamp in milliseconds.

Let’s verify that our document has effectively been created:

Let’s inspect the mapping that have been created dynamically by Elasticsearch

Elasticsearch has automatically inferred the type of each field based on a set of simple rules.

Explicit mapping

Dynamic mapping is great but it can leads to unexpected search results.

Let’s suppose we want to search for all users born in the 80s. Our user Bobby Kennedy is born at epoch 553941680000 which corresponds to 1987–07–22. It should pop up in the results. We can issue a date range query on the date_of_birth field:

No hits ??

If you had thoroughly looked at the mapping, you shouldn’t be surprised. Our date_of_birth field has a long type because we did not tell Elasticsearch it should store it as a date. This is where explicit mapping come into play. Let’s delete our index, set a mapping with explicit date type for date_of_birth , and reindex our document.

You cannot change the mapping of an index when documents are already present.

Note that you can set the mapping only partially. Mappings for other fields will be dynamic.

Let’s reissue our range query again

Hourah it worked !

You know more about your data than Elasticsearch does

Analysers

Our description field has been indexed as type text. The text type is suited for full text values like description, body of emails, etc. Text values are passed through an analyser that converts the original string into a list of individual terms that can be indexed. The default analyser simply tokenizes the string and lower case each term.

Standard analyser

We can inspect how the analyser works thanks to the _analyse API

Let’s verify a search for the word “passionate” matches :

This is great but we would like Elasticsearch to be tolerant to the different inflected forms of a word. In our example I would like my document to match if I search for the word passion. This can be achieved with a language analyser that will stem the tokens. This process is called lemmatization and will group together all inflected forms of a word. Ex: passionate → passion, foxes → fox, jumped → jump, etc.

English analyser

Let’s remove the index, apply the english analyser to our description field, and reindex our document.

Let’s verify a search for the word passion matches:

The analysis is very flexible, you can even define your own analyser. There are also plenty of analysis plugins both official or supported by the community. A very interesting one is the phonetic analyser that analyses tokens into their phonetic equivalent using Soundex, Metaphone and other codecs.