How GumGum developed our named entity recognition (NER) system for Japanese texts.
Written by k-shin Nishiyama on March 13, 2019
I am Keishin, a member of the Natural Language Processing (NLP) team at GumGum. My team works on a variety of NLP problems, such as text classifications, keyword rankings, text extraction from htmls, and more. Among those tasks, I will share how we developed our named entity recognition (NER) system for Japanese texts.
NAMED ENTITY RECOGNITION (NER)
Named entities are the phrases that contain the actual names of real world entities, like persons, organizations, locations, etc., and NER is the task to extract those phrases, and classify them into predefined categories such as person, location, and organization. The following is an example from CONLL 2003.
[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .
This example shows “U.N.” as a named entity which belongs to the category ORG (ORGANIZATION), “Ekeus” belongs to PER (PERSON), and “Baghdad” to LOC (LOCATION). Given a sentence: “U.N official Ekeus heads for Baghdad”, a NER system extracts those 3 named entities and classifies them into suitable entity categories. We use 4 entity categories: PER (PERSON), LOC (LOCATION), ORG (ORGANIZATION), and MISC (MISCELLANEOUS). MISC contains product names, event names, and any other names of entities that don’t fit into the other 3 categories.
We started off the Japanese NER project with creating a dataset. First, we annotated about 2,500 articles from the Livedoor news corpus. The reason we picked this corpus is that the contents are actual web articles from different news categories and we wanted to create our NER system so that it works well on web articles from different sources. The following is an example of our annotation on Livedoor corpus.
After we tested out our initial model trained on the Livedoor-corpus data, we figured that we need more training data for better performance. Since Japanese texts contains a lot of different characters, especially in Kanji (Chinese characters), out-of-vocabulary problems occur frequently even when model inputs are characters. To solve that problem, we had to create a bigger training corpus. We borrowed an idea from the paper: Fine-Grained Entity Recognition, and created a training set from Wikipedia articles by using internal links and infobox information. In wikipedia dumps, a phrase which contains internal links to a different Wikipedia article is surrounded by “[[“ and “]]”. We checked a link is connected to a named entity or not by checking the linked article. However, we cannot simply check and annotate every single article in wikipedia. Instead, we mapped infobox tags to entity categories and checked the tags in articles to determine the pages are about named entities or not.
(a) Infobox in Toyota page.
(b) Markov chain Monte Carlo methods are primarily used for calculating numerical approximations of multi-dimensional integrals, for example in Bayesian statistics, … In Bayesian statistics, … ( from Markov chain Monte Carlo)
(a) is the initial part of the infobox tag, “ company”, which is widely used in company pages to create an infobox. We mapped this company infobox tag to ORG entity type and did the same for about 300 infobox tags. Generally, Wikipedia internal links are only placed when a given term appears for the first time in an article, and not after that. See Bayesian statistics in (b). So, we decided to use only first 4 sentences in an article, to lessen the occurrence of named entities without internal links. © is a sample sentence from a wikipedia article and (d) is a annotation version of it. Because [[ベンチャー]] and [[株式市場]] in (c ) are not named entities, we do not mark them and treat them as normal texts in (d). (e) shows final annotation outcome in IOB2 tag format.
(c ) [[NASDAQ]]は、[[アメリカ合衆国]]にある世界最大の新興企業（[[ベンチャー]]）向け[[株式市場]]である。
(d) [[ORG NASDAQ]]は、[[LOC アメリカ合衆国]]にある世界最大の新興企業（ベンチャー）向け株式市場である。
(e) B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-ORG, O, O, B-LOC, I-LOC, I-LOC, I-LOC, I-LOC, I-LOC, I-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O
The model we developed is similar to LSTM-CNNs-CRF, but it takes only characters for its inputs and does not take words. This is because Japanese writings do not put spaces between words, as a result, tokenization, or morphological analysis, is not deterministic. The biggest problem for NER is if mismatches happen between word and entity boundaries, word-based models cannot handle such cases correctly. Word-based models assume that a word belongs to a tag and word boundaries always match entity boundaries. (f) is a case where word and entity boundaries match, but (g) is a mismatched case. Given token as in (g), word-based models never extract “Ivanka” as a named entity.
(f) イヴァンカ氏 (Ms. Ivanka) → イヴァンカ, 氏 (Ivanka, and Ms.)
(g) イヴァンカ氏 (Ms. Ivanka) → イヴァン, カ氏 (Ivan and fahrenheit)
The following image is an overview of our NER model. It extracts surrounding character information with CNNs and a sentence-wide information with a BiLSTM. A linear-chain CRF is added on top of them to consider label dependencies. Before feeding a text to our model, we split a text into sentences by using MeCab with a user dictionary. Our model processes a sentence at a time.
The following table is a tag-level performance of our model measured on Livedoor annotated corpus. The model is trained on Wikipedia data first and then fine-tuned on 70 % of Livedoor data. Those scores in the table are calculated on the remaining 30% of Livedoor data. Also, we decided to merge tags into B, I, O tags to minimize the effect of extreme imbalance label distributions. Since we merged tags, we have another model that classifies extracted entities into 4 entity categories after that, which is the same approach as Fine-Grained Entity Recognition paper. Our segmentation model achieved 0.88 and 0.89 f-1 scores on B and I tags.