Paragraph Vector เป็น อยู่ คือ?

Published in

lukkiddd

4 min readJan 7, 2019

หากพูดถึง Word2Vec หลาย ๆ คน คงทราบว่า มันมีความสามารถในการเอา vector ของคำมา บวกลบกัน เช่น king — man + woman = queen เป็นต้น

**ตัวอย่างการนำ vector ของคำแต่ละคำมาบวกลบกัน**

แต่เจ้า Word2Vec เอง มันก็มีข้อจำกัดอยู่บ้าง หากเราต้องการใช้งานระดับประโยค
เราอาจจะต้อง

นำ vector ของคำทั้งหมดในประโยคนั้นมา บวก รวมกัน หรือ หาค่าเฉลี่ย ซึ่งนั่นอาจทำให้เราสูญเสีย “ข้อมูลลำดับคำ” ไป
Paper: Compositional Operators in Distributional Semantics
เอาประโยคมาทำ parse tree แต่นั่นก็จะทำงานได้แค่ระดับประโยคเท่านั้น (sentence-level)
Paper: Parsing Natural Scenes and Natural Language with Recursive Neural Networks

ในบทความนี้เราจะพูดถึง Distributed Memory for Paragraph Vector มันจะเข้ามาช่วยแก้ปัญหาเหล่านี้เพราะ เราสามารถใช้กับประโยค หรือ บทความ ที่มีความยาวเท่าไหร่ก็ได้

Outline

Paragraph Vector คือ
Paragraph Vector ทำอย่างไร
Paragraph Vector กับ Recommender System

1. Paragraph vectors คือ

Paragraph vectors หรือ Document vectors (Doc2Vec)ได้ถูกคิดค้นมาจากการต่อยอดของ Word2Vec โดยใน Paper: Distributed Representations of Sentences and Documents ได้เล่าถึง 2 โมเดลด้วยกันคือ

A). Paragraph Vector: Distributed Memory Model

B). Paragraph Vector: Distributed Bag of words Model

A). Paragraph Vector: Distributed Memory Model

เรามาเริ่มกันที่โมเดลแรกเลยดีกว่า สำหรับ Distribured Memory Model มันจะมีหน้าตาคล้ายๆ กับ Word2Vec (Continuous Bag-of-Words) เลยครับ โดยโมเดลจะทำการทำนายคำ จากบริบทรอบๆ

มาดูตัวอย่างกันครับ สมมติว่าเราจะทำ Paragraph Vectors ของภาพยนตร์ จาก คำอธิบายของภาพยนตร์

ผมขอนิยาม Paragraph หรือ Documentในที่นี้เป็นคำอธิบายภาพยนตร์นะครับ

เช่น Star Trek ถ้าเราจะทำ Word2Vec ก็จะทำคล้ายๆ กับภาพด้านล่าง

อ่านเพิ่มเติมเรื่อง Word2vec ได้ที่บทความนี้ครับ

Word2Vec ทำอย่างไร

สวัสดีครับผู้อ่านทุกท่าน เนื้อหาส่วนใหญ่จะมาจาก วิดีโอ tech talk เรื่อง Word2Vec ของพี่อาร์ม, blog post ของ the morning…

lukkiddd.com

คราวนี้พอมาเป็น Paragraph Vector เนี่ย เราก็แค่สร้าง Vector ตัวนึงขึ้นมาเพื่อให้มันจดจำข้อมูลของคำทุกคำใน Paragraph หรือ Document นั้น

โดยเวลาที่เราเทรนเราก็จะเลื่อนชุดคำไปเรื่อย ๆ แต่ Document เป็นตัวเดิม(คำอธิบายภาพยนตร์เรื่องเดิม) เช่น ภาพยนตร์เรื่อง Star Trek เราก็จะใช้ Vector ตัวเดิม (Star Trek) แต่เราจะเลื่อนคำไปเรื่อย ๆ เหมือน Word2Vec ครับ

Paragraph Vector — Distributed Memory Model

สุดท้ายเราก็จะได้ Vector ออกมา โดย Vector เหล่านี้ก็จะแสดงถึงคำอธิบายของภาพยนตร์แต่ละเรื่องนั่นเอง

B). Paragraph Vector: Distributed Bag of Words Model

อีกโมเดลนึงที่ Paper ได้นำเสนอ ก็คือ Distributed Bag of Words Model ครับ ซึ่งหน้าตาก็จะคล้ายๆ กันกับ Skip-Gram model ใน Word2Vec เลยครับ

แต่สิ่งที่ Paragraph Vector: Distributed Bag of Words Model ทำคือ ตัว Input จะเป็น Vector ของภาพยนตร์ ส่วน Target จะเป็นการสุ่มคำในคำอธิบายภาพยนตร์เรื่องนั้นมาใช้แทน

ซึ่งข้อแตกต่างของ Distributed Bag of Words กับ Distributed Memory Model คือ โมเดล Distributed Bag of Words จะไม่สนลำดับคำ นั่นคือมันสนแค่ว่าในคำอธิบายภาพยนตร์เรื่อง Star Trek มีคำว่าอะไรบ้างนั่นเองครับ

2. Paragraph Vector ทำอย่างไร

สำหรับวิธีการทำ Paragraph Vector ก็ไม่ยากเลยครับ เราสามารถใช้ library ที่มีอยู่แล้วในปัจจุบันอย่างเช่น gensim ทำได้เลยครับ ทั้ง 2 โมเดล

ตัวอย่าง

วิธีการใช้งาน gensim โดยละเอียด ลองดูเพิ่มเติมที่เว็บของ gensim เองได้เลยครับ

gensim: topic modelling for humans

Efficient topic modelling in Python

radimrehurek.com

ตัวอย่างการใช้งานเพิ่มเติม

RaRe-Technologies/gensim

Topic Modelling for Humans. Contribute to RaRe-Technologies/gensim development by creating an account on GitHub.

github.com

3. Paragraph Vector กับ Recommender System

Paragraph Vector สามารถนำมาใช้ทำ Recommender System ได้ โดยที่เมื่อเราทำการสร้าง Vector สำหรับสิ่งของนั้น ๆ (เช่น ภาพยนตร์) เราก็สามารถนำ Vectors ของภาพยนตร์แต่ละเรื่องมาหาความคล้ายกันได้ (เช่นใช้ Cosine Similarity เพื่อหาความคล้ายกันของ Vector) โดยดูจากคำอธิบายของภาพยนตร์ ซึ่งมีคนทำตัวอย่างไว้ด้วยครับ

ในตัวอย่างข้างล่างเป็นการทำ Movie Recommender System บนข้อความรีวิว

Movie Recommender System Based on Natural Language Processing

Introduction Natural Language Processing (NLP) is rarely used in recommender systems, let alone in movie…

sites.northwestern.edu

อีกตัวอย่างนึงเป็นการทำระบบ Recommender ข่าว ของ Reuters.com ครับ

Document vectors in the wild: Building a content recommendation system for Reuters.com - Strata…

James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in…

conferences.oreilly.com

ในสไลด์ เนื้อหาก็จะเกี่ยวกับ Doc2Vec และวิธีที่เค้าใช้ในการวัดความคล้ายกันของเนื้อหา โดยเค้าใช้ Triplet accuracy เป็นตัววัดครับ ซึ่งมาจาก Paper เรื่อง Document Embedding with Paragraph Vectors รวมถึงเล่าวิธีที่เค้าใช้ทดสอบว่ามันมีประสิทธิภาพเหมาะกับงานของเค้าไหมด้วยครับ

นอกจากนี้ Paragraph Vector ยังสามารถเอาไปทำอย่างได้อีกด้วย ยกตัวอย่างเช่น

Information Retrieval: ใช้ในการหาเนื้อหาที่เกี่ยวข้องกับค้นที่ใช้ค้นหา
Text Classification: ใช้ใน Embedding Layer เหมือนกับ Word2Vec แต่ใช้เป็นระดับ Document แทน

ตัวอย่างเหล่านี้สามารถศึกษาเพิ่มเติมได้จาก Paper: Distributed Representations of Sentences and Documents เลยครับ

แถม

สิ่งที่น่าสนใจมากไปกว่านั้นคือ ใน Paper เรื่อง Document Embedding with Paragraph Vectors เค้าได้ทำการเทรน Doc2Vec บน wikipedia แล้วลองเอา Paragraph Vectors ของ “Lady Gaga” ไปทำการลบ Word Vectors ของ “American” และบวก Word Vectors ของ “Japanese” สุดท้ายเราจะได้ Lady Gaga ของประเทศญี่ปุ่นครับ

ผลลัพธ์ที่ได้

สรุป

Paragraph Vector สามารถใช้กับประโยค หรือ บทความ ที่มีความยาวเท่าไหร่ก็ได้
มี 2 โมเดล Distributed Memory Model (สนลำดับคำ) และ Distributed Bag of Words (ไม่สนลำดับคำ)
Paragraph Vector สามารถทำได้โดยใช้ Gensim Library
Vector ที่ได้ สามารถมาหาความคล้ายกัน เพื่อทำระบบ Recommender ได้
Vector ที่ได้ สามารถเอามาทำ Operation กันได้ (บวกลบ)

ทิ้งท้าย

หลังจากอ่านแล้ว เพื่อน ๆ คนไหนเอา Paragraph Vector ไปทำอะไร ก็มาเขียนบอกเล่ากันได้นะครับ และหากมีความสนใจเกี่ยวกับ Thai Natural Language Processing ก็เข้ามาพูดคุยกันได้ที่กลุ่ม ThaiNLP นะครับ แล้วพบกันใหม่บทความหน้านะครับ สวัสดีคร้าบ :D

Paragraph Vector เป็น อยู่ คือ?

Outline

1. Paragraph vectors คือ

A). Paragraph Vector: Distributed Memory Model

Word2Vec ทำอย่างไร

สวัสดีครับผู้อ่านทุกท่าน เนื้อหาส่วนใหญ่จะมาจาก วิดีโอ tech talk เรื่อง Word2Vec ของพี่อาร์ม, blog post ของ the morning…

B). Paragraph Vector: Distributed Bag of Words Model

2. Paragraph Vector ทำอย่างไร

ตัวอย่าง

gensim: topic modelling for humans

Efficient topic modelling in Python

RaRe-Technologies/gensim

Topic Modelling for Humans. Contribute to RaRe-Technologies/gensim development by creating an account on GitHub.

3. Paragraph Vector กับ Recommender System

Movie Recommender System Based on Natural Language Processing

Introduction Natural Language Processing (NLP) is rarely used in recommender systems, let alone in movie…

Document vectors in the wild: Building a content recommendation system for Reuters.com - Strata…

James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in…

แถม

สรุป

ทิ้งท้าย

References:

บทความอื่น ๆ ที่น่าสนใจ

Written by lukkiddd