To tackle text related problem in Machine Learning area, tokenization is one of the common pre-processing. In this article, we will go through how we can handle work tokenization and sentence tokenization by using three libraries which are spaCy, NLTK and jieba (for Chinese word).
Step 1: Environment Setup
Install spaCy (2.0.11)
pip install spacy==2.0.11
Step 2: Import library
import spacy
spacy_nlp = spacy.load('en_core_web_sm')
Step 3: First test
doc = spacy_nlp(article)
tokens = [token.text for token in doc]print('Original Article: %s' % (article))
print()
print(tokens)
Result:
Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.
['In', 'computer', 'science', ',', 'lexical', 'analysis', ',', 'lexing', 'or', 'tokenization', 'is', 'the', 'process', 'of', 'converting', 'a', 'sequence', 'of', 'characters', '(', 'such', 'as', 'in', 'a', 'computer', 'program', 'or', 'web', 'page', ')', 'into', 'a', 'sequence', 'of', 'tokens', '(', 'strings', 'with', 'an', 'assigned', 'and', 'thus', 'identified', 'meaning', ')', '.', 'A', 'program', 'that', 'performs', 'lexical', 'analysis', 'may', 'be', 'termed', 'a', 'lexer', ',', 'tokenizer,[1', ']', 'or', 'scanner', ',', 'though', 'scanner', 'is', 'also', 'a', 'term', 'for', 'the', 'first', 'stage', 'of', 'a', 'lexer', '.', 'A', 'lexer', 'is', 'generally', 'combined', 'with', 'a', 'parser', ',', 'which', 'together', 'analyze', 'the', 'syntax', 'of', 'programming', 'languages', ',', 'web', 'pages', ',', 'and', 'so', 'forth', '.']
Step 4: Second test
print('Original Article: %s' % (article2))
print()
doc = spacy_nlp(article2)
tokens = [token.text for token in doc]
print(tokens)
Result
Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456
['ConcateStringAnd123', 'ConcateSepcialCharacter_!@', '#', '!', '@#$%^&*()_+', '0123456']
First step of spaCy separates word by space and then applying some guidelines such as exception rule, prefix, suffix etc.
Official example is “Let’s go to N.Y.!” and it tokenizes as:
Another library is NLTK. It is the classical tools on NLP area. Let’s go.
Step 1: Environment Setup
pip install nltk=3.2.5
Step 2: Import library
Load corresponding package
import nltkprint('NTLK version: %s' % (nltk.__version__))
Step 3: First test
print('Original Article: %s' % (article))
print()
print(nltk.word_tokenize(article))
Result:
Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.['In', 'computer', 'science', ',', 'lexical', 'analysis', ',', 'lexing', 'or', 'tokenization', 'is', 'the', 'process', 'of', 'converting', 'a', 'sequence', 'of', 'characters', '(', 'such', 'as', 'in', 'a', 'computer', 'program', 'or', 'web', 'page', ')', 'into', 'a', 'sequence', 'of', 'tokens', '(', 'strings', 'with', 'an', 'assigned', 'and', 'thus', 'identified', 'meaning', ')', '.', 'A', 'program', 'that', 'performs', 'lexical', 'analysis', 'may', 'be', 'termed', 'a', 'lexer', ',', 'tokenizer', ',', '[', '1', ']', 'or', 'scanner', ',', 'though', 'scanner', 'is', 'also', 'a', 'term', 'for', 'the', 'first', 'stage', 'of', 'a', 'lexer', '.', 'A', 'lexer', 'is', 'generally', 'combined', 'with', 'a', 'parser', ',', 'which', 'together', 'analyze', 'the', 'syntax', 'of', 'programming', 'languages', ',', 'web', 'pages', ',', 'and', 'so', 'forth', '.']
Step 4: Second test
print('Original Article: %s' % (article2))
print()
print(nltk.word_tokenize(article2))
Result:
Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456
['ConcateStringAnd123', 'ConcateSepcialCharacter_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_+', '0123456']
The behavior is a little difference from spaCy. NLTK treats most of special character as a “word” except “_”. Of course, number will be tokenized as well.
jieba
English word tokenization is easier by comparing to Chinese. If we want tokenize Chinese word, we can use jieba.
Step 1: Environment Setup
pip install jieba=0.39
Step 2: Import library
Load corresponding package
import jieba
print('jieba Version: %s' % jieba.__version__)
Step 3: First test
Used one of the famous song (崇拜) from Fish Leong(梁靜茹)
article3 = '你的姿態 你的青睞 我存在在你的存在 你以為愛 就是被愛'print('Original Article: %s' % (article3))
print()words = jieba.cut(article3, cut_all=False)
words = [str(word) for word in words]
print(words)
Result:
Original Article: 你的姿態 你的青睞 我存在在你的存在 你以為愛 就是被愛
['你', '的', '姿態', ' ', '你', '的', '青睞', ' ', '我', '存在', '在', '你', '的', '存在', ' ', '你', '以', '為', '愛', ' ', '就是', '被', '愛']
Step 4: Second test
article4 = '词法分析是计算机科学中将字符序列转换为标记序列的过程。进行词法分析的程序或者函数叫作词法分析器,也叫扫描器。词法分析器一般以函数的形式存在,供语法分析器调用。'print('Original Article: %s' % (article4))
print()words = jieba.cut(article4, cut_all=False)
words = [str(word) for word in words]
print(words)
Result:
Original Article: 词法分析是计算机科学中将字符序列转换为标记序列的过程。进行词法分析的程序或者函数叫作词法分析器,也叫扫描器。词法分析器一般以函数的形式存在,供语法分析器调用。
['词法', '分析', '是', '计算机科学', '中将', '字符', '序列', '转换', '为', '标记', '序列', '的', '过程', '。', '进行', '词法', '分析', '的', '程序', '或者', '函数', '叫作', '词法', '分析器', ',', '也', '叫', '扫描器', '。', '词法', '分析器', '一般', '以', '函数', '的', '形式', '存在', ',', '供', '语法分析', '器', '调用', '。']
jieba does a great job on tokenizes Chinese word (both simplified chinese to traditional chinese).
Conclusion
The demonstration can be found in the Jupyter Notebook.
spaCy seems like having a intelligence on tokenize and the performance is better than NLTK. If you need to tokenize, jieba is a good choice for you. Also, studied spaCy (version 2.x) Chinese language implementation. They wrapped jieba library. From lang/zh/__init__.py
# copy from spaCy/lang/zh/__init__.py
class Chinese(Language):
lang = 'zh'
Defaults = ChineseDefaults # override defaults
def make_doc(self, text):
try:
import jieba
except ImportError:
raise ImportError("The Chinese tokenizer requires the Jieba library: "
"https://github.com/fxsjy/jieba")
words = list(jieba.cut(text, cut_all=False))
words = [x for x in words if x]
return Doc(self.vocab, words=words, spaces=[False]*len(words))
On the other hand, Stanford NLP also released a word tokenize library for multiple language including English and Chinese. You may visit the official website if you are interested.