Problem of the Japanese language in Elasticsearch

XSolve
XSolve Blog

--

Elasticsearch is a search engine supporting full-text search for large amounts of data. It’s based on the open-source Lucene library. So much for theory. In most cases, you use the Latin alphabet in your search entries. And what if you need to carry out a full-text search for Arabic or Japanese scripts, or other languages which don’t use the Roman characters?

In this article, I’m going to show you what problems I’ve met and how I’ve solved them while working on a project for the Japanese market. This example will demonstrate that the standard analyzers provided by Elasticsearch are not able to cooperate correctly with logographic writing systems.

I’ve spent the last couple of months participating in the creation of an app for the Japanese market. When making the mapping for the index, I used a standard analyzer for text fields, assuming that it would handle the Japanese characters. It was a mistake. First development tests proved me wrong. I tested the application, entering single nouns. The result I got was a malfunctioning application.

Tokenization and keyword analysis in Elasticsearch

In Elasticsearch, analyzers work in two steps.

The first one is tokenization: the process of breaking a stream of text up into words, e.g. the sentence:

The quick brown fox jumped over the lazy dog

will be divided into individual words:

[The, quick, brown, fox, jumped, over, the, lazy, dog]

The second step consists in normalization, whose characteristics depend on the analyzer used. In the example above, I’ve used simple words without other signs or numbers. But what happens if the text contains non-alphanumeric characters as well? Let’s check it out. In Elasticsearch, two most popular analyzers are used: standard and simple. Here’s some information about them.

Standard analyzer:

  • is the default tool used for text fields,
  • removes most punctuation marks,
  • changes upper case into lower case.

Now I’m going to test the analysis of a more complex sentence by the standard analyzer on the following example: Set the shape to semi-transparent by calling set_trans(5)

The result of the analysis of the sentence above is:

[set, the, shape, to, semi, transparent, by, calling, set, trans]

As you see, punctuation has been removed and uppercase letters have been converted into lowercase ones.

Simple analyzer:

  • removes anything that is not a letter,
  • changes upper case into lower case.

So, the result of the analysis will be the following:

[set, the, shape, to, semi, transparent, by, calling, set, trans]

Japanese words analysis problem

I’m going to try and analyze the sentence This is delicious sushi in Japanese using the simple and standard analyzers. The sentence in Japanese looks as follows:

寿司がおいしいね.

To do this, I will use the _analyze function available in API Elasticsearch.

That’s the request:

The analysis by means of the standard analyzer results in breaking the sentence into a board

of single signs: [寿,司, が, お, い, し, い, ね]. And that is wrong.

Next, I’m going to test the simple analyzer’s work.

The result is the one-record board: [寿司がおいしいね]. And that is wrong too.

As you can see in the token field, both analyzers have analyzed the sentence incorrectly.

The first one has divided the text into single characters, and the other has interpreted it as a single word. Both basic analyzers have been unable to process the sentence well. To solve that problem, you need to use the Kuromoji plugin, which supports full-text search for Elasticsearch.

Kuromoji plugin installation

The installation is really simple. From the Elasticsearch folder, just write the command:

sudo bin/elasticsearch-plugin install analysis-kuromoji

Kuromoji plugin

After installing the Kuromoji plugin, I conducted a test, also using the _analyze method.

That’s the request:

And that’s the result:

The test has been correctly interpreted and analyzed by Elasticsearch. The result is the board consisting of two separate words: [寿司], [がおいしいね]

Conclusion

As you could see in my examples, standard analyzers provided by Elasticsearch cannot deal with the Japanese writing system. The dedicated plugin available on the creators’ website is a good solution to that problem. Thanks to the simple way of installing and operating the plugin, it’s easy to implement. The only thing you need to keep in mind is to set the index mapping on text fields, so that Elasticsearch could analyze the texts correctly. For the sake of simplicity, I’ve employed the dedicated _analyze method in the examples presented in this article. I’ve used Elasticsearch version 5.1.1.

Originally published at xsolve.software on March 14, 2017.

--

--

XSolve
XSolve Blog

Agile Software House focused on PHP/Symfony, Java, JavaScript and Mobile.