NLP Pipeline: Word Tokenization (Part 1)

Edward Ma
5 min readMay 21, 2018
Source: http://youthvoices.net/discussion/will-you-1-powerful-words

To tackle text related problem in Machine Learning area, tokenization is one of the common pre-processing. In this article, we will go through how we can handle work tokenization and sentence tokenization by using three libraries which are spaCy, NLTK and jieba (for Chinese word).

Source: https://spacy.io/

Step 1: Environment Setup

Install spaCy (2.0.11)

pip install spacy==2.0.11

Step 2: Import library

import spacy
spacy_nlp = spacy.load('en_core_web_sm')

Step 3: First test

doc = spacy_nlp(article)
tokens = [token.text for token in doc]
print('Original Article: %s' % (article))
print()
print(tokens)

Result:

Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

['In', 'computer', 'science', ',', 'lexical', 'analysis', ',', 'lexing', 'or', 'tokenization', 'is', 'the', 'process', 'of', 'converting', 'a', 'sequence', 'of', 'characters', '(', 'such', 'as', 'in', 'a', 'computer', 'program', 'or', 'web', 'page', ')', 'into', 'a', 'sequence', 'of', 'tokens', '(', 'strings', 'with', 'an', 'assigned', 'and', 'thus', 'identified', 'meaning', ')', '.', 'A', 'program', 'that', 'performs', 'lexical', 'analysis', 'may', 'be', 'termed', 'a', 'lexer', ',', 'tokenizer,[1', ']', 'or', 'scanner', ',', 'though', 'scanner', 'is', 'also', 'a', 'term', 'for', 'the', 'first', 'stage', 'of', 'a', 'lexer', '.', 'A', 'lexer', 'is', 'generally', 'combined', 'with', 'a', 'parser', ',', 'which', 'together', 'analyze', 'the', 'syntax', 'of', 'programming', 'languages', ',', 'web', 'pages', ',', 'and', 'so', 'forth', '.']

Step 4: Second test

print('Original Article: %s' % (article2))
print()
doc = spacy_nlp(article2)
tokens = [token.text for token in doc]
print(tokens)

Result

Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

['ConcateStringAnd123', 'ConcateSepcialCharacter_!@', '#', '!', '@#$%^&*()_+', '0123456']

First step of spaCy separates word by space and then applying some guidelines such as exception rule, prefix, suffix etc.

Official example is “Let’s go to N.Y.!” and it tokenizes as:

Source: https://spacy.io/usage/spacy-101#annotations-token
Source: https://geonaut.co.uk/projects/programming/

Another library is NLTK. It is the classical tools on NLP area. Let’s go.

Step 1: Environment Setup

pip install nltk=3.2.5

Step 2: Import library

Load corresponding package

import nltkprint('NTLK version: %s' % (nltk.__version__))

Step 3: First test

print('Original Article: %s' % (article))
print()
print(nltk.word_tokenize(article))

Result:

Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.['In', 'computer', 'science', ',', 'lexical', 'analysis', ',', 'lexing', 'or', 'tokenization', 'is', 'the', 'process', 'of', 'converting', 'a', 'sequence', 'of', 'characters', '(', 'such', 'as', 'in', 'a', 'computer', 'program', 'or', 'web', 'page', ')', 'into', 'a', 'sequence', 'of', 'tokens', '(', 'strings', 'with', 'an', 'assigned', 'and', 'thus', 'identified', 'meaning', ')', '.', 'A', 'program', 'that', 'performs', 'lexical', 'analysis', 'may', 'be', 'termed', 'a', 'lexer', ',', 'tokenizer', ',', '[', '1', ']', 'or', 'scanner', ',', 'though', 'scanner', 'is', 'also', 'a', 'term', 'for', 'the', 'first', 'stage', 'of', 'a', 'lexer', '.', 'A', 'lexer', 'is', 'generally', 'combined', 'with', 'a', 'parser', ',', 'which', 'together', 'analyze', 'the', 'syntax', 'of', 'programming', 'languages', ',', 'web', 'pages', ',', 'and', 'so', 'forth', '.']

Step 4: Second test

print('Original Article: %s' % (article2))
print()
print(nltk.word_tokenize(article2))

Result:

Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

['ConcateStringAnd123', 'ConcateSepcialCharacter_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_+', '0123456']

The behavior is a little difference from spaCy. NLTK treats most of special character as a “word” except “_”. Of course, number will be tokenized as well.

jieba

English word tokenization is easier by comparing to Chinese. If we want tokenize Chinese word, we can use jieba.

Step 1: Environment Setup

pip install jieba=0.39

Step 2: Import library

Load corresponding package

import jieba
print('jieba Version: %s' % jieba.__version__)

Step 3: First test

Used one of the famous song (崇拜) from Fish Leong(梁靜茹)

article3 = '你的姿態 你的青睞 我存在在你的存在 你以為愛 就是被愛'print('Original Article: %s' % (article3))
print()
words = jieba.cut(article3, cut_all=False)
words = [str(word) for word in words]
print(words)

Result:

Original Article: 你的姿態 你的青睞 我存在在你的存在 你以為愛 就是被愛

['你', '的', '姿態', ' ', '你', '的', '青睞', ' ', '我', '存在', '在', '你', '的', '存在', ' ', '你', '以', '為', '愛', ' ', '就是', '被', '愛']

Step 4: Second test

article4 = '词法分析是计算机科学中将字符序列转换为标记序列的过程。进行词法分析的程序或者函数叫作词法分析器,也叫扫描器。词法分析器一般以函数的形式存在,供语法分析器调用。'print('Original Article: %s' % (article4))
print()
words = jieba.cut(article4, cut_all=False)
words = [str(word) for word in words]
print(words)

Result:

Original Article: 词法分析是计算机科学中将字符序列转换为标记序列的过程。进行词法分析的程序或者函数叫作词法分析器,也叫扫描器。词法分析器一般以函数的形式存在,供语法分析器调用。

['词法', '分析', '是', '计算机科学', '中将', '字符', '序列', '转换', '为', '标记', '序列', '的', '过程', '。', '进行', '词法', '分析', '的', '程序', '或者', '函数', '叫作', '词法', '分析器', ',', '也', '叫', '扫描器', '。', '词法', '分析器', '一般', '以', '函数', '的', '形式', '存在', ',', '供', '语法分析', '器', '调用', '。']

jieba does a great job on tokenizes Chinese word (both simplified chinese to traditional chinese).

Conclusion

The demonstration can be found in the Jupyter Notebook.

spaCy seems like having a intelligence on tokenize and the performance is better than NLTK. If you need to tokenize, jieba is a good choice for you. Also, studied spaCy (version 2.x) Chinese language implementation. They wrapped jieba library. From lang/zh/__init__.py

# copy from spaCy/lang/zh/__init__.py
class Chinese(Language):
lang = 'zh'
Defaults = ChineseDefaults # override defaults

def make_doc(self, text):
try:
import jieba
except ImportError:
raise ImportError("The Chinese tokenizer requires the Jieba library: "
"https://github.com/fxsjy/jieba")
words = list(jieba.cut(text, cut_all=False))
words = [x for x in words if x]
return Doc(self.vocab, words=words, spaces=[False]*len(words))

On the other hand, Stanford NLP also released a word tokenize library for multiple language including English and Chinese. You may visit the official website if you are interested.

--

--

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/