How to classify Japanese text with fastText

Tatsuya Yokoyama
5 min readMar 13, 2018

--

Introduction

Text Classification is one of the important topics of Machine Learning because it gives us article categorization, spam filtering, language identification, sentiment analysis and so on.

In this post, I would like to introduce how to classify Japanese text with fastText and we will build our classifier which automatically classifies Japanese text into three categories, such as Politics, Technology and Sports. Then, when someone browse web pages, like news articles, blog posts and some products, we can recommend other web pages which have same categories using this classifier. If it works well, users can save time and effort to find other web pages which they might like to browse. I believe that it would be amazing UX.

What is fastText?

FastText is an open source library for efficient text representations and classification, which was developed by Facebook. According to their posts, fastText is faster, better text classification. Although deep neural networks achieve very good performance, they can be slow to train and test, which limits the way to use them. They solve this problem by using a hierarchical data structure instead of a flat structure. You can watch following YouTube video, 4:22~, to know in more detail about it.

Steps

I will explain step by step how to do it as follows.

  1. Get started
  2. Prepare training data
  3. Pre-process
  4. Train model
  5. Evaluate

Step 1 Get started

We use a Python interface for fastText. If you don’t install Cython, you need to install Cython before do this.

$ pip install cython
$ pip install fasttext

Then we need install MeCab, which is an open source text segmentation library, and originally developed by Taku Kudo(工藤 拓).

You know Japanese text are not separated, “今日はいい天気ですね”, unlike English, “It is sunny today.”, and we have to separate them before we input text to fastText as the following. Separating Japanese text is called ‘Wakati-gaki’ (分かち書き).

Before Wakatigaki:
今日はいい天気ですね
After Wakatigaki:
今日

いい
天気
です

We usually do Wakati-gaki by MeCab. So install MeCab and dictionary for MeCab.

$ brew install mecab mecab-ipadic

Also we use mecab-python3, which is python wrapper for MeCab. We install this as follows:

$ pip install mecab-python3

Step 2 Prepare training data

To get training data which are labeled three category — Politics, Technology and Sports, I scraped some Japanese news sites. I got 1,000 data each category, that is I got 3,000 data in total.

  • The number or data 3,000
  • 1,000 per each category
  • Category- Politics,Technology,Sports

Each line of the text file contains a list of labels, followed by the corresponding content. By default, fastText recognizes what is a label of contents by the prefix __label__ . Then we save the data as article.train and we will use it in step 4.

if you would like to know the way to crawl web sites, just let me know. I might explain it in another post

Step 3 Pre-process

As mentioned in step 1, we have to separate Japanese text to words array before training. After that, we have to normalize words like orthographic variants. To do this I used a way showed in neologd/mecab-ipadic-neologd, which is customized system dictionary for MeCab.

import MeCab
import re
import unicodedata
class Normalizer:
def unicode_normalize(self, cls, s):
pt = re.compile('([{}]+)'.format(cls))
def norm(c):
return unicodedata.normalize('NFKC', c) if pt.match(c) else c
s = ''.join(norm(x) for x in re.split(pt, s))
s = re.sub('-', '-', s)
return s
def remove_extra_spaces(self, s):
s = re.sub('[ ]+', ' ', s)
blocks = ''.join(('\u4E00-\u9FFF', # CJK UNIFIED IDEOGRAPHS
'\u3040-\u309F', # HIRAGANA
'\u30A0-\u30FF', # KATAKANA
'\u3000-\u303F', # CJK SYMBOLS AND PUNCTUATION
'\uFF00-\uFFEF' # HALFWIDTH AND FULLWIDTH FORMS
))
basic_latin = '\u0000-\u007F'
def remove_space_between(self, cls1, cls2, s):
p = re.compile('([{}]) ([{}])'.format(cls1, cls2))
while p.search(s):
s = p.sub(r'\1\2', s)
return s
s = remove_space_between(blocks, blocks, s)
s = remove_space_between(blocks, basic_latin, s)
s = remove_space_between(basic_latin, blocks, s)
return s
def normalize_neologd(self, s):
s = s.strip()
s = unicode_normalize('0-9A-Za-z。-゚', s)
def maketrans(self, f, t):
return {ord(x): ord(y) for x, y in zip(f, t)}
s = re.sub('[˗֊‐‑‒–⁃⁻₋−]+', '-', s) # normalize hyphens
s = re.sub('[﹣-ー—―─━ー]+', 'ー', s) # normalize choonpus
s = re.sub('[~∼∾〜〰~]', '', s) # remove tildes
s = re.sub('[0-9]', '', s)
s = s.translate(maketrans('!"#$%&\'()*+,-./:;<=>?@[¥]^_`{|}~。、・「」','!”#$%&’()*+,-./:;<=>?@[¥]^_`{|}〜。、・「」'))
s = re.sub(re.compile("[!-/:-@[-`{-~]"), '', s)
s = re.sub(re.compile('[!"#$%&\'()*+,-./:;<=>?@[¥]^_`{|}~。、・「」]'), '', s)
s = re.sub(re.compile('[■□◆◇◯“…【】『』!”#$%&’()*+,-./:;<=>?@[¥]^_`{|}〜。、・「」]'), '', s)
s = remove_extra_spaces(s)
s = unicode_normalize('!”#$%&’()*+,-./:;<>?@[¥]^_`{|}〜', s) # keep =,・,「,」
s = re.sub('[’]', '\'', s)
s = re.sub('[”]', '"', s)
s = s.replace('\n','')
s = s.replace('\r','')
return s
def get_nouns(self, text):
tagger = MeCab.Tagger('')
tagger.parse("")
node = tagger.parseToNode(normalize_neologd(text))
nouns = []
while node:
if node.feature.split(",")[0] == '名詞':
#print(node.feature)
nouns.append(node.surface)
node = node.next
return ' '.join(nouns)

I recommend you to use above normalizer. We can call this normalizer as followings.

normalizer = Normalizer()
normalizer.normalize_neologd('word')

You know we usually evaluate a model after training. In order to do this, we divide 3,000 data to 2,000 training data and 1,000 test data. And as fastText loads data by file, we need to write each record to the files — article.train and article.test.

f_train = open("article.train", 'w')
f_test = open("article.test", 'w')
normalizer = Normalizer()
for index, row in df.iterrows():
input = row['label'] + ' ' + normalizer.get_nouns(row['content']) + "\n"
if index < 2000:
f_train.write(input)
else:
f_test.write(input)
f_train.close()
f_test.close()

Step 4 Train model

We are now ready to train a model. supervised function has two argument. The first one is a file containing a label and context as I showed in step 2. The second one is model filename, model.bin and model.vec, generated by article.train . model.bin is a binary file containing the parameters of the model and used to predict the labels. model.vec is a text file containing the word vectors, which capture features of text. This training would finish for a few seconds.

import fasttext
classifier = fasttext.supervised('article.train', 'model')

Step 5 Evaluate

Finally we have generated our classifier. Before prediction, let’s evaluate it by computing the precision on a test set using classifier.test function.

print('P@1:', result.precision)
print('Number of examples:', result.nexamples)
# P@1: 0.854
# Number of examples: 1000

Of course, we can predict labels of Japanese text using this classifier as follows.

text='Appleが、Lightning端子に耐水パッキンを追加し、充電中の耐水性能を確保できる技術の特許を申請していたことが明らかになりました。iPhoneなどの充電中にも耐水性能が追加できるようになると期待できます。'
labels = classifier.predict([normalizer.get_nouns(text)])
print(labels)
# [['technology']]

Summary

I introduced how to classify Japanese text with fastText. Although there are more steps to classify Japanese text than to do English one because of Wakati-gaki, many Japanese NLP scientists and engineers have already develop how to handle Japanese. That’s why it is easy to handle Japanese now.

And fastText enable us to classify text fast. Then we can build model to classify Japanese easily. If you’d like, try it!

--

--