引起你對 Word2Vec 基本概念

Published in

DAAI

8 min readDec 15, 2018

首先先提個問題, 如果有一個工具可以轉換角色你願意使用它嗎? 像是國王(king)減掉男人(man)加上女人(woman)會變皇后(queen)的話, 你會不會覺得這工具很有趣呢? 其實這概念最早是來自 Tomas Mikolov 這篇 Linguistic Regularities in Continuous Space Word Representation.

沒錯！這個工具就是 Word2Vec. 它在 Machine Learning 中是屬於非監督式學習(Unsupervised Learning), 主要可以分為 CBOW(Continuous Bag Of Words) 及 Skip-gram 兩種模型, 是一種透過文本(corpus)來產生詞向量(Distributed Representations 或 Word Vector 或 Word Embedding)的方式, 換句話說就是可以將目前詞彙所在的空間映射到其它維度空間, 當然..主要運用的領域就是 NLP (Nature Language Process)

優點：

降維(Reduce Dimension)
詞彙與詞彙間有語意的存在

Advantages:

1. Can measure similarity easily

2. Learned representations from huge amounts of unlabelled data

3. Linear regularities among words(king-man+woman=queen)

BUT: does require huge amounts of data

那我應該怎麼做呢？根據你的應用並想盡辦法根據 Word2Vec Tuning 出比較好的 hyperParameters 和 parameters 並利用品質好的 data 訓練出高品質的詞向量在做運用囉！

Two Basic Neural Network Models

如圖所示, 有兩種可以訓練的神經模型 CBOW 和 Skip-gram, 主要可以分為 3 層, 第一層為 input layer 也就是資料的輸入層, 第二層則是projection layer (主要產生 projection matrix 也就是 weight), 最後一層則是 output layer 輸出層. CBOW 主要透過前後文推敲出目標字(context as input, target word as output)而 Skip-gram 則剛好相反. 根據 Efficient Estimation of Word Representations in Vector Space 這篇 paper 的敘述, 同樣的資料在 Skip-gram 訓練時間上相對於 CBOW 而言會比較久, 而在 NLP 的運用上, 句子的 syntactic(句法)適合 CBOW, semantic(語意) 則是適合 skip-gram.

From that context, predict the target word (Continuous Bag of Words or CBOW approach)
From the target word, predict the context it came from (Skip-gram approach)

作者 Tomas Mikolov 希望每個詞彙可以表達出它在語意上存在的意義，並想出了一個 Idea。一個詞彙的意義，或許可以利用身邊的詞 (Context)去表示它。就像是自己身邊的朋友們，可以反映出自己是個怎麼樣的人一樣。 是不是很哲學阿..>0< 以下舉個例子囉:
Sentence 1: 跳舞是我最喜歡的活動
Sentence 2: 唱歌是我最喜歡的活動
由上面兩句話透過前後文, 可得出跳舞跟唱歌某種程度上會很類似
The whole intuition behind the Word2Vec approach consists of representing a word based on its context. This means that words appearing in similar contexts will be similarly embedded.

Skip-gram Training Samples

這邊我們以 skip-gram 為例, 假設我們有一句 sentence 為 The quick brown fox jumps over the lazy dog, 那如果考慮的 windows size 為 2, 如圖右側會產生 training samples 的 pair 作為資料的訓練集, 而圖中的藍色部份為 input data 所對應到的 pair 為 labelled data 並透過下一個 section 中的 architecture 做訓練。

Architecture of Our Neural Network

Input Layer

透過 corpus 中的資料將每個詞用 One-Hot 的形式作表示做為 Input，所以大家可以了解到，Input Layer的長度就是你 corpus 中所擁有的詞的總數這邊表示為 V(圖上的 10000 positions)，另外 N(圖上的 300 neurons) 為 Dimension 的大小。

Ex : The quick brown fox jumps over the lazy dog(以下 V 為 9)

The = [1, 0, 0, 0, 0, 0, 0, 0, 0]

quick = [0, 1, 0, 0, 0, 0, 0, 0, 0]

brown = [0, 0, 1, 0, 0, 0, 0, 0, 0]

…

….
dog = [0, 0, 0, 0, 0, 0, 0, 0, 1]

那…為什麼要用 One-Hot 來做為 Input 呢? 對於 One-Hot 表示法而言會讓每個詞，彼此之間都是 orthogonal(正交, 也就是詞彙與詞彙的 similarity 為 0)，並且 input matrix * projection matrix (weight) 時就有兩大優點:

weight(V*N) 上的的每一列可以作為最後每個詞彙 training 完的 embedding
而所產生的結果只是 look up 查表一樣而已

Projection Layer

Word Embedding 就是在這一層, 因為他的 Input 的長度是 V 所以他這裡的矩陣的大小就會是 V*N。

Hidden Layer

語意隱藏的一層, 其實並沒有具體的證明這一層的物理意義, 不過你可以想像它是空間概念中銜接在中間的轉接口。它是經過 projection layer 後的下一層, 故可以推得它的長度會是 N。

Output Layer

最後透過 Softmax() 輸出預測每一個詞的機率，所以它的長度會是 V。

Implementation(實作)

以下是透過 wiki 公開的資料並使用 gensim 來訓練出 Word2Vec model, 當然剛剛的結構設計介紹 gensim library 都幫你設計完囉, 故只要給予某些重要的參數即可達到 Word2Vec 練習, 如果想更了解 wiki data 訓練流程可參考我的 Github 程式碼。