Handling out-of-vocabulary problem in Indonesian named entity recognition

Published in

Kata.ai Tech Blog

6 min readOct 24, 2018

This is a blog post for our paper Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts. This post only contains a few interesting findings from the paper. Interested readers should consult the full paper for more.

Introduction

At Kata.ai, we have a neural network model for dealing with sequence labeling task such as named-entity recognition (NER). In this task, we are interested in detecting named entities (e.g., locations, names) automatically in a sentence. For instance, given a sentence “Berapa perbedaan poin Valentino Rossi dengan Jorge Lorenzo?” (What is the score difference between Valentino Rossi and Jorge Lorenzo?), a NER system should be able to recognize that “Valentino Rossi” and “Jorge Lorenzo” are person names. A NER system is crucial for conversational agents because it can detect the user’s name, or place of origin and destination in a flight booking chatbot.

Two example sentences with highlighted named entities.

Typical neural network models to solve a task like NER accept a sequence of word embeddings — vector representations of words in a sentence — as their input. The values of these embeddings are usually learned during training. As a consequence, such models can only learn the embedding of words that occur in the training data. At test time (i.e. when the model is deployed to production), there would be words that have never occurred in the training data. These words are often called out-of-vocabulary (OOV) words. These OOV words may pose a problem: since OOV words never occur in the training data, the model never learns their embeddings, and thus cannot represent them as input. Note that this OOV problem is especially apparent in conversational domain (e.g. chatbots) due to the creative writing style.

A common remedy to this problem is to replace rare words (e.g., occur only once) in the training data with a special token for unknown words — say <UNK> — and learn its embedding during training. Then, at test time, all OOV words in the input sentence are replaced with this <UNK> token before feeding their word embeddings as input to the model. Therefore, all OOV words get a single vector representation.

The above approach to handle the OOV problem breaks down when the OOV rate (i.e. percentage of OOV words in the input) is high because the model would “see” most words as <UNK>. As an extreme example, if all words in the input sentence are OOV, then what the model sees is just <UNK>. It is unreasonable to expect the model to recognize entities in such case; all words are the same! Also, the approach ignores that words can share similar affixes/characters which can be exploited to better estimate their embeddings. Fortunately, there are many literature addressing this problem and one of them is by Rei et al., (2016). They proposed to combine the word embedding with the composition of its character embeddings — vector representations of characters in a word — and used that combination as the input instead. Intuitively, working with characters mitigate the OOV problem because ideally, our training data is large enough to contain all the possible characters. Therefore, there would be no OOV characters, which means embedding of an OOV word can be approximated by the composition of its character embeddings. They tested their methods on sequence labeling tasks such as POS tagging and NER for English and achieved good results. Thus, we decided to test their approach on our Indonesian NER dataset. Note that we won’t explain Rei et al.’s method in this post. Interested readers should consult their original paper for details.

Methodology

In this section, we explain a bit about how the experiments are done. This section won’t include all the details.

Dataset

We used 2 datasets: SMALL-TALK and TASK-ORIENTED. SMALL-TALK contains conversation dataset of our users interacting with our chatbot Jemma. TASK-ORIENTED is obtained from YesBoss service, which mostly contains imperative sentences like ordering food or booking flight.

We split each dataset into three parts: training, development, and test set. Table below shows both dataset statistics including the OOV rate. The OOV rate is the percentage of unique words that do not occur in the training set.

Sentence length (L), number of sentences (N), and OOV rate (O) of each dataset.

Comparisons

As comparisons, we used a very simple model which memorizes the mapping from word to entity tag on the training set (Memo). At test time, it outputs the tag it has memorized for each word it encounters. We also used conditional random fields (CRF) as another comparison, as CRF is a very common non-neural model for sequence labeling tasks like NER. As for the Rei et al.’s model, we compared all three variants: word embedding-only (Word), concatenation (Concat), and attention (Attn). Concatenation and attention refer to how the character embeddings composition is combined with the word embedding.

Evaluation

Evaluation is done with CoNLL evaluation scheme, that is only exact matches are counted. From these matches, F1 score are then computed. If you’re not familiar with F1 score, think of it as accuracy; the higher, the better.

Results and Discussion

Table below reports the F1 score for each model and dataset on the test set.

From the table, we see that all neural models are better than the non-neural ones. We also see that the word embedding-only model is consistently worse than both the concatenation and attention model, suggesting that character embedding indeed helps.

Varying out-of-vocabulary rate

To test our hypothesis about how character embedding helps when OOV rate is high, we conducted another experiment by varying the OOV rate. We compared the three variants of the neural model and the result is reported below.

Test F1 score for each neural model evaluated on both datasets when OOV rate is varied.

From the figures, we see that across all datasets, the models which incorporate character embedding, either by concatenation or attention, outperform word embedding-only model when OOV rate is high. Their performance is very stable, even when OOV rate is as high as 90%, whereas the word embedding-only model completely fails. This finding strongly suggests that incorporating character embedding is indeed effective to mitigate OOV problem.

Conclusion

The high amount of OOV rate in conversational texts poses a challenging problem for a NER system in a conversational agent. However, using neural network models that incorporate character embedding can mitigate the issue effectively, as we found that such models can still perform well even when the OOV rate is as high as 90%. The character models are implemented in our NL Studio in Kata Platform to handle our chatbot conversations.