Chinese Word Segmentation

The challenges of splitting Chinese sentences into words

Wai Ching Leung
5 min readJan 4, 2022

One of the standard preprocessing steps for downstream NLP tasks is tokenization or segmentation. This step is basically splitting raw text into words. In English, this is a rather simple task because words are separated by a space. There are some slightly more complicated cases such as splitting possessive affixes from possessors or splitting a contracted word into two words — but still, they are easy to handle.

In Chinese, we usually call this task segmentation. As some of you may already know, words are not separated by a space in Chinese, and this has caused a huge headache in word segmentation. One of the issues is that without the help of space, ambiguities arise because characters, which are the building blocks of words in Chinese, can combine with other characters in different ways to form different words; characters alone can also be words.

To further clarify, consider this example:

“你们研究所有十个图书馆”

Interpretation 1: 你们(“you”)/研究(“to study”)/所有(“all”)/十(“ten”)/个( classifier)/图书馆(“library”)

Interpretation 2: 你们(“you”)/研究所(“institute”)/有(“to have”)/十(“ten”)/个(classifier)/图书馆(“library”)

For humans, to derive the appropriate segmentation is very easy— we only need to know what the conversation context is. However, for computers, it is not a trivial task because it requires parsing of surrounding sentences to derive context. And how do we guarantee the right parsing of the surrounding sentences if we can’t guarantee that they are segmented correctly?

Another challenge is that linguists don’t always agree on the segmentation of words. For example, for the term “生物学”, which means “(study of) biology”, we can keep the whole term as one single word or we can segment the term into two units:

It all comes down to the definition of what a word is — and yes, there have been debates over this definition.

To readers: what does a word mean to you? And how would you segment “生物学”?

I used these two examples just to give you a (bitter) taste of what Chinese segmentation is like — hopefully by now, you understand why this task is not as simple as tokenization in English.

Chinese Word Segmentation with LAC, Jieba, Stanza and SnowNLP

Luckily, there are multiple off-the-shelf pre-trained libraries that we can use for Chinese segmentation. I played around with four commonly used libraries for Chinese segmentation, LAC, Jieba, Stanza, and SnowNLP, to see how they segment these two sentences:

(A title from a New York Times article dated on 28th Dec 2021)

“我是一名ICU医生,疫情面前我们仍有理由抱有希望”

“I am an ICU doctor, we still have the reason to have hope amid the pandemic”

(A well-known Chinese proverb)

“一寸光阴一寸金,寸金难买寸光阴”

“An interval of time is worth an ounce of gold, but money cannot buy time”/ “time is precious”

Here is how we can use the libraries in Python:

LAC

LAC stands for Lexical Analysis of Chinese. It is a tool developed by Baidu.

from LAC import LAC
lac = LAC(mode="seg")
seg_result = lac.run(sentence)

Jieba

Jieba is a module that is specifically used for Chinese word segmentation.

import jieba
seg_result = jieba.lcut(sentence) #returns a list
#alternatively:
#seg_result = jieba.cut(sentence) # returns a generator

Stanza

This library was created by the Stanford NLP Group. It contains different tools for linguistic analysis such as POS tagging, lemmatization and segmentation. It also handles 66 languages which include Chinese.

import stanza
stanza.download('zh', processors='tokenize')
nlp = stanza.Pipeline('zh', processors='tokenize')
doc = nlp(sentence)
for sentence in doc.sentences:
seg_result = []
for word in sentence.words:
seg_result.append(word.text)

snowNLP

Inspired by TextBlob, this library was created to specifically process Chinese. It also has a built-in sentiment analyser for your convenience.

from snownlp import SnowNLP
s = SnowNLP(sentence)
seg_result = s.words

Let’s take a look at the results:

For the sentence “我是一名ICU医生,疫情面前我们仍有理由抱有希望”, we can see some variations in segmenting “一名” and “抱有”.

I think “一名” should be split into two words because “一” (which means ‘one”) and “名” (a classifier for people) do not combine to form a different meaning in the sentence. In other words, they retain their individual meaning.

In fact, “一名” shares a similar function with the phrase like “a group”. So if you agree that “a group” (as in “ there is a group of students” ) should be split into two words, “a” and “group”, you should also agree that “一名” should be split into “一” and “名”.

For “抱有”, which means “to have”, I think it should be considered one single word because the character “抱 (“to hug”) does not retain its meaning in the sentence. Unless… the doctor really wants to hug hope :)

Rather, “抱” combines with “有” (“to have”) to form a different meaning: to have. If you are wondering about the difference between “有” and “抱有”, here is a very simple explanation: although bearing a similar semantic meaning, “抱有” and “有” have a different usage.

I did not expect the segmenters to perform well on the proverb because its structure and expressions are more archaic and rare. For example, to express the concept of time, we usually use “时间” instead of “光阴” in every day conversations and writings.

Some final thoughts on Chinese word segmentation for NLP

We have tested the libraries with only two sentences and have already seen quite some variations in the results. So as you can imagine, Chinese word segmentation is quite a challenging pre-proccessing task. Therefore, I always pay close attention to segmentation during pre-processing for Chinese NLP because I don’t want any cascading effect on other tasks which would eventually affect model results.

Finally, I want to reiterate: just because word tokenization in English is relatively easy does not mean it is easy in Chinese, and all other languages.

These are just some thoughts I have about Chinese word segmentation. Let me know if you agree or disagree — I am always happy to discuss anything related to Chinese NLP, and am open to new ideas :). I would also love to hear your experience with other Chinese segmentation libraries/methods!

References:

Leonardo Badino. 2004. Chinese text word-segmentation considering semantic links among sentences. In Proceedings of Interspeech.

--

--

Wai Ching Leung

M.S Computational Linguistics @ Georgetown University | Machine Learning & NLP | Let’s connect and collaborate!