[NLP] Four Ways to Tokenize Chinese Documents

With some empirical evidence in support of sub-word segmentation techniques

tl;dr — If you’re starting a new Chinese NLP project, use SentencePiece to do the tokenization (you can choose to do word segmentation before that). Use the unigram language model algorithm unless you’re really short on computing resources, especially when predicting (in that case, use byte pair encoding algorithm).

This is a follow-up of a previous post about the ongoing Chinese NLP project. It’s sort of a monthly update of the project progress.

Again, the tokenization techniques introduced here can also be applied to other languages that do not use space to separate words. However, the experiment results of this Chinese dataset does not necessary extends to other language, so you might want to do your own experiments. We’ll probably provide an easier interface to train and compare different technique in this project later.

The Empirical Evidence

Let’s cut to the chase and show you the experiment results first. Don’t worry if you feel confused by this. You can skip this section and come back later.

(20181118 Correction: The metric used final two columns was actually MSE instead of RMSE.)

The results came from sentiment analysis models trained with a quite noisy dataset (introduced in the previous post). This is only one dataset, so please take it with a grain of salt.

Character-level models and word-level models are clearly underperforming comparing to the sub-word models (BPE and unigram). The unigram language model technique is only slight better than the byte pair encoding (BPE) technique (the difference between these two can probably be explained by noise).

Sub-word segmentation was done deterministically. So we did not utilize one of the unigram technique’s strengths — sub-word regularization.

The small dataset is sampled from the full dataset, with 15,000 training examples for each class. A validation subset with 1/3 size of the training subset is used. The small and full dataset shared the same test subset.

(Comparing logloss/perplexities from language models won’t make sense because different tokenization technique yields different vocabulary, and hence different target size.)

Experiment Notebooks

The notebooks are inside the notebooks folder of this project (tag 0.0.3):

The current state of the project contains a lot of redundancy, and a large portion of the code is buried inside notebooks. It’s really embarrassing. We’ll be doing some refactoring in the next development cycle.

Now let’s begin to discuss these four ways of tokenization:

1. Character as a Token

Treat each (in our case, Unicode) character as one individual token. This is the technique used in the previous post.

Pros: This one requires the least preprocessing. And it reflects the fact that each Chinese character has its own meaning.

Cons: The model needs to take the relative positions of the characters into account. Bag-of-words/characters features won’t have much predictive power. The model needs to be able to learn longer-term dependencies than all other methods, and be more complex in order to learn the words.

2. Word as a Token

Do word segmentation beforehand, and treat each word as a token. Because it works naturally with bag-of-words models, AFAIK it is the most used method of Chinese NLP projects.

We used THULAC analyzer and segment the entire corpus before tokenization. A slower but more popular tool is jieba.

Pros: It’s how human reads Chinese sentences. And after segmentation it is basically the same as in English. Most models that works for English corpora can be used to this Chinese corpus now.

Cons: The correctness of segmentation cannot be guaranteed. Even though many segmentation tools now claims to have 90%+ accuracy, but the evaluation corpora they used tend to be formal writings. We don’t have many training and evaluation datasets of much more noisy casual conversation and writings publicly available. Also, this method cannot handles unseen words and performs poorly on rare words.

3. Something in between — Byte-Pair Encoding

Bojanowski et al. (2016) introduced subword units into word embeddings [1]. It represents a word with several character n-grams, which can overlaps in the original word. For example, the word <where> can be represented by these n-grams: <wh, whe, her, ere, re>.

What Sennrich et al. (2016) proposed is something simpler and requires no modification of the model structure[2]. It adapts Byte Pair Encoding (BPE) (Gage, 1994), a data compression technique[3], for word segmentation. It first splits the whole sentence into individual characters, then iteratively merges the most frequent pair of adjacent characters or character sequences into larger character sequences.

It’s a blend of character and word level encoding. Most common words will be represented as a single token, and others will be represented as a combination of several non-overlapping tokens. This will help the model handle rare and unseen words.

In the example above, the word <lowest> will be segmented as <low e s t ·>, where “·” is the special end-of-word symbol.

To put it into practice. SentencePiece from Google (not an official product) provides high-performance BPE segmentation and has a nice Python module:

You can check the README of SentencePiece as well as this script from our project to learn about how to fit the BPE model and performs segmentation:

Because BPE algorithm is very simple, some projects chose to implement it by themselves, e.g. the OpenAI generative pre-training model[4] which is covered in this post:

4. Something in between — Unigram Language Model

BPE is based on a greedy and deterministic symbol replacement, which can not provide multiple segmentation with probability. [5]

Kudo (2018) proposed a new subword segmentation algorithm based on a unigram language model [5]. It can be seen as a probabilistic mixture of characters, subwords, and word segmentation.

One important assumption that each subword occurs independently, which is highly unlikely because the occurrences of some subwords should be highly correlated. But this assumption allows us to formulate the probability of a subword sequence more easily as the product of the subword occurrence probabilities:

And because those subword occurrence probabilities are hidden variables, it uses the EM algorithm that maximize the following marginal likelihood:

If we read it from right to left, we’ll see that it first calculates the sum of probabilities of all segmentation candidates, takes log, and then sums the results from all sentences together.

(Honestly I haven’t fully understood this model yet. It seems that it’s not learning the actual subword occurrence probabilities, since it can be estimated just by counting. I couldn’t find an intuitive answer to what is actually learned by maximizing the marginal likelihood. Maybe it is related to my suspicion that segmentation candidates are being truncated or sampled since they grow exponentially with respect to the sentence length.)

Now we can do subword regularization by picking the l-best segmentation according to the conditional probability P(x|X) and samples from them in training time (and optionally, testing time).

Even using deterministically, the unigram algorithm to be better empirically than those from BPE (thanks to Jeremy Howard for the tip):

The “it” in the above tweet was referring to SentencePiece, which uses the unigram algorithm by default.

Pre-tokenization of Subword Algorithms

In the experiment word segmentation was performed before BPE or unigram algorithm. In doing so the encoding will be the same as in English.

But it doesn’t have to be. SentencePiece is efficient enough, so we can feed it the raw sentence and let it figure out how to segment without the intervention of a third-party word analyzer:

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese, Japanese and Korean where no explicit spaces exist between words.

This can be helpful both in terms of complexity of the pipeline and the internal noises. As we mentioned, word analyzer can be problematic, especially with copora that differs from the training dataset of that analyzer. Additionally, subword regularization could be more effective if we allow it to include the uncertainty in word segmentation.

I haven’t try using SentencePiece without pre-tokenization yet. Will try to do that in the next few development cycle.


Thanks for reading. The project did not progress fast enough partly due to me being on the road for half of September. When organizing the results of the experiments I thought a blog post introducing these approaches can be helpful to people. Please feel free to leave constructive criticism or give this post some claps if you fell it helped you. Thanks again!


  1. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. [link]
  2. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. [link]
  3. Philip Gage. 1994. A New Algorithm for Data Compression. CUsers J., 12(2):23–38, February.
  4. Radford, A., & Salimans, T. (2018). Improving Language Understanding by Generative Pre-Training. [link]
  5. Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. [link]