Practical First Steps to NLP in Japanese

Sharat Chinnapa
The HumAIn Blog
Published in
5 min readMay 3, 2021

Recently we’ve been hiring people from India to work on NLP with us. In the interviews I took them out of their comfort zone by presenting a dataset that was primarily Japanese text.

As a native English speaker who does not know Japanese, there are significant barriers to entry when working on Japanese text. Right off the bat people struggle with these two areas:

  1. Not being able to read the letters hampers their intuitive judgement
  2. There is no obvious way to split words apart in the text

For anyone who is planning to work long term on Japanese NLP, learning the language is a very good investment — even just knowing the basic letters will pay off.

If you want to have lovely literary models, definitely read to the end.

In this article I’m going to go over some basics and some practical workarounds get started.

  1. Translation Tools are your Friend
  2. Tokenization Pitfalls
  3. Making Word Embeddings

I have prepared two notebooks to illustrate tokenization and word embedding creation, which are linked at the end of the article.

Translation Tools are your Friend

It always surprises me how many people don’t go around translating things indiscriminately. As Data Scientists, we do rely on our intuition to get us to ask the correct questions. Not being able to read the data is a significant barrier to this intuition. Translation tools are far from perfect but they can go some distance in helping your intuition grasp the problem.

Here are some options:

・Google Translate (website)

・Rikaikun (chrome extension)

・DeepL (website and downloadable app)

Which ever you use, the idea is that the tool should allow you to quickly lookup things to get the general idea of what you’re dealing with. I have started to favour DeepL recently because it can be hot-keyed (pressing cmd+c twice looks up whatever is selected on the desktop app).

Tokenization Pitfalls

Character level tokenization

Japanese is a continuous script: it does not use spaces to separate words. As a result, the first step of many a project is to accomplish that. One school of thought is to create tokens on the character level. It is a good solution, and it might work for your specific problem.

The main benefit here is that it’s very easy to do:

list("Your Japanese text here")  # yes, that's it.

The thought behind it is that Japanese characters are “like words”. For example “猫” is “cat” and “犬” is dog (use a translation tool to check if that’s true!). This holds true to some extent as even compound words like “東京” (Tokyo) can often be parsed on the character level — in this case “East” and “Capital”. However, unlike Chinese, Japanese has 3 scripts in total, and all three scripts are often mixed together in naturally occurring text. The examples provided above all feature “Kanji”, whereas the two other scripts Hiragana and Katakana do not contain meaning on a character level. Some words like “新しい” (atarashii: “new”) even mix the two scripts within a single word. Hence, while character level tokenization is easier, it’s often worth going through the trouble to use word level tokenization.

Wait…what is a word?

There are a whole host of libraries to accomplish word level tokenization. The main pitfall here is that different libraries are going to tokenize things differently. Yes… in other words there is disagreement on what a “word” is.

The route of this trouble comes from compound words. Especially in the Kanji script, just as two letters with character-level meaning can be joined together to make a word — two words with standalone meanings can also be joined to form other words.

For example:

“小学” means “elementary school”

“小学生” means elementary school student made from “小学” + “生” where (“生” means “life” — and about 20 other things too). Alternatively, it can be argued that 小学生 comes from “小” (small) + “学生” (student) — which also sounds reasonable.

The debate then arises as to whether to tokenize “小学生” as 1 token, or 2 and if it is 2 tokens, which two tokens?

Indeed, two popular tokenization libraries, Nagisa and MeCab can disagree exactly this way with their vanilla settings.

As you can see, this can quickly become a deep dark rabbit hole that someone not exceedingly proficient in Japanese linguistics would like to stay very very far away from.

The trap is easy to avoid in practice once you’re aware of it. Just use whichever tokenization policy works best for your dataset. But how does one build intuition regarding this? At an initial level, I find that taking a look at the quality of the nearest neighbours for tokens in your dataset is a way to quickly build intuition as to how to handle this. For this, we need to make word embeddings.

Make Word Embeddings

Word embeddings are an n-dimensional mathematical representation of a word that we can use for NLP. The idea is that representing words this way can create clusters of “meaning” such that similar words are grouped together in the space and to some extent can be mathematically computed.

The famous example being:

"king" - "man" + "woman" = "queen"

FastText is a great source of pre-trained word embeddings for multiple languages, and we can use it here.

Your tokenization library and your word embeddings should ideally work well together, and fastText was trained using MeCab, so that would generally be the first choice of tokenizer to use. However, it is important to note that depending on your dataset and the dictionary you’re using with MeCab, it might still give you suboptimal results in certain cases.

You can generally validate that by tokenizing some input data, and printing out 3 nearest neighbours for each token. The similarity score will tell you if the model knows this word well (usually well understood words seem to have nearest neighbours around 0.9 or more similarity) — cross checking the translation of the token and the nearest neighbour using a translation tool is also a must. Here’s how that works:

ft.get_nearest_neighbors('東京', k=3)
# ideal case "Tokyo" (nearest neighbours are Osaka, Yokohama etc.)
=> Output:
[(0.9795494079589844, '大阪'), (0.966786801815033, '横浜'), (0.9433713555335999, '京都')]
ft.get_nearest_neighbors('東京都', k=3)# "Tokyo-city" as a word makes no sense to the model (nearest castes are "Inspire", "Future" etc)=> Output:
[(0.38261038064956665, 'インスパイア'), (0.37217432260513306, 'ヒューチャー'), (0.3658350706100464, 'ダイアンレイン')]

If your tokenization logic is splitting words to a level of compounding that your model is not tuned to, it’s going to be problematic — so either making sure the two play well together, or fine tuning the fastText model using your tokenizer is a must.

Resources

Here are two notebooks that contain code for the basic setup and can allow you to experiment with the ideas outlined in this blog post.

Nagisa and FastText

MeCab and FastText

Hope you find this useful, and all the best on your NLP journey!

--

--

Sharat Chinnapa
The HumAIn Blog

Programmer, writer, dancer, learning how to make the world a better place at HumAIn.