การใช้งาน Library NLTK เบื้องต้น

Nuttapong Suptawepong

Published in

Super AI Engineer

2 min readJan 8, 2021

สวัสดีครับ ผมชื่อนายณัฐพงษ์ ทรัพย์ทวีพงศ์ เป็นผู้เข้าร่วมโครงการ Super AI Engineer รหัส 22p12c0765

สำหรับบทความนี้ ผมมาแบ่งปันความรู้เบื้องต้นเกี่ยวกับการใช้งาน Library NLTK (Natural Language Toolkit) ครับ

เริ่มจากการติดตั้ง Library NLTK

!pip install nltk

จากนั้น Import NLTK เข้ามาใช้งาน

import nltknltk.download ( “all” )

Word Tokenization และ Sentence Tokenization

Tokenization คือ การนำข้อความมาตัดออกให้เป็นคำ (word) หรือตัดออกให้เป็นประโยค (sentence) ก็ได้

เริ่มจากการ Import คำสั่งที่เกี่ยวข้อง และกำหนดตัวแปร text ที่จะใช้ในการทำ Word Tokenization และ Sentence Tokenization

from nltk.tokenize import word_tokenize, sent_tokenizetext = “Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue. You should not eat cardboard.”

การตัดข้อความออกมาเป็นคำ ๆ จะใช้คำสั่ง word_tokenize

word_tokenize ( text )

ผลลัพธ์ที่ได้ คือ จะได้ List ของ Word ดังนี้
[‘Hello’,
‘Mr.’,
‘Smith’,
‘,’,
‘how’,
‘are’,
‘you’,
‘doing’,
‘today’,
‘?’,
‘The’,
‘weather’,
‘is’,
‘great’,
‘and’,
‘Python’,
‘is’,
‘awesome’,
‘.’,
‘The’,
‘sky’,
‘is’,
‘pinkish-blue’,
‘.’,
‘You’,
‘should’,
‘not’,
‘eat’,
‘cardboard’,
‘.’]

สำหรับการตัดข้อความออกมาเป็นประโยค ๆ จะใช้คำสั่ง sent_tokenize

sent_tokenize ( text )

ผลลัพธ์ที่ได้ คือ จะได้ List ของ Sentence ดังนี้
[‘Hello Mr. Smith, how are you doing today?’,
‘The weather is great and Python is awesome.’,
‘The sky is pinkish-blue.’,
‘You should not eat cardboard.’]

Lemmatization

คือ การแปลงคำ (word) ต่าง ๆ ให้อยู่ในรูปพื้นฐานของคำนั้น ๆ ตัวอย่างเช่น คำว่า is, am, are เมื่อทำการ Lemmatization แล้วจะกลายเป็นคำว่า be

เริ่มจากการ Import คำสั่งที่เกี่ยวข้อง และกำหนดตัวแปร list_words ที่จะใช้ในการทำ Lemmatization

from nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnetlist_words = [ "playing", "plays", "played", "play", "is", "am", "are", "be", "goose", "geese", "mouse", "mice"

จากนั้น สร้าง Object WordNetLemmatizer

lemmatizer = WordNetLemmatizer ( )

ใช้คำสั่ง lemmatizer.lemmatize เพื่อทำการแปลงคำต่าง ๆ ให้อยู่ในรูปพื้นฐานของคำนั้น ๆ โดยได้กำหนดพารามิเตอร์ pos = wordnet.VERB ไปด้วย เพื่อบอกว่าต้องการแปลงในกรณีที่คำ ๆ นั้นเป็น Verb เท่านั้น

[ lemmatizer.lemmatize ( word, pos = wordnet.VERB ) for word in list_words ]

ผลลัพธ์ที่ได้คือ คำกริยา จะถูกแปลงให้อยู่ในรูปพื้นฐานของมัน คือจากคำว่า playing, plays, played จะกลายเป็นคำว่า play และจากคำว่า is am are จะกลายเป็นคำว่า be ส่วนคำประเภทอื่น ๆ ที่ไม่ใช่คำกริยาจะไม่มีการเปลี่ยนแปลงใด ๆดังนี้
[‘play’,
‘play’,
‘play’,
‘play’,
‘be’,
‘be’,
‘be’,
‘be’,
‘goose’,
‘geese’,
‘mouse’,
‘mice’]

ซึ่งถ้ากำหนดพารามิเตอร์เป็น pos = wordnet.NOUN จะเป็นการบอกว่าต้องการแปลงในกรณีที่คำ ๆ นั้นเป็น Noun เท่านั้น

[ lemmatizer.lemmatize ( word, pos = wordnet.NOUN ) for word in list_words ]

ผลลัพธ์ที่ได้คือ คำนาม จะถูกแปลงให้อยู่ในรูปพื้นฐานของมัน คือจากคำว่า geese จะกลายเป็นคำว่า goose และจากคำว่า mice จะกลายเป็นคำว่า mouse ส่วนคำประเภทอื่น ๆ ที่ไม่ใช่คำนามจะไม่มีการเปลี่ยนแปลงใด ๆ ดังนี้
[‘playing’,
‘play’,
‘played’,
‘play’,
‘is’,
‘am’,
‘are’,
‘be’,
‘goose’,
‘goose’,
‘mouse’,
‘mouse’]

สำหรับโค้ดทั้งหมดนี้ สามารถเข้าไปดูผ่าน Google Colab ได้ที่ https://colab.research.google.com/drive/19hLNvXMlxNcNzTNff-HwNZ1bHh__VyVf?usp=sharing

การใช้งาน Library NLTK เบื้องต้น

Word Tokenization และ Sentence Tokenization

Lemmatization

Written by Nuttapong Suptawepong