Word embeddings_文字探勘_原理

Published in

Data Scientists Playground

6 min readApr 2, 2018

Word embedding (Word vector)是近年來文字探勘非常熱門的技術，主要是用來將文字轉換成向量，透過向量化可進行大量運算。

流程為將文字 One — hot encoding，將會得到一個高維度的且稀疏的矩陣。而其中One — hot encoding的向量值，可透過深度學習的模型進行學習。

其中，好的Word embedding，是可以將“類似的”字詞在向量空間上是靠近彼此的。

例如： Vector(電腦) = 0.87 / Vector(筆電) = 0.86

此外，也可以進行字詞上的運算

範例1: Vector(國王) — Vector(男)+Vector(女) = Vector(皇后) (經典範例
範例2: Vector(人) — Vector(朋友) = Vector(邊緣人)

如何獲得Word embedding:

透過目標任務 (例如：情感分析、主題分類)，將字詞一起學習，而Word embedding，就類似使用神經網路在學習權重，其中最有名的方始就是Word2Vector (skip-gram or CBOW)。
透過預載已學習過的結果到我們將要學習的模型，網路上有相當多資源是已經學習好的Word embedding結果，可將之載入。例如：GloVe

以下我們將用第一個做為範例 (任務: 情感分析)

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding# 自行定義文字
words = 
['好棒','hen開心','雖敗猶榮','太66666','太神拉','森七七','過氣','撿角','崩潰','QQ']
# 正反面情感定義label
labels = array([1,1,1,1,1,0,0,0,0,0])
# vb_size字典長度
vb_size = 50
#docs_dict -> 轉換成one-hot encoding
docs_dict = [one_hot(d, vocab_size) for d in words]

將字詞轉換成one -hot encoding

# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length)
# maxlen -> 將資料整理成相同長度 (padding or truncating)
print(padded_docs[0])

切割成固定長度

model = Sequential()

# Embedding layer主要是將單字映射到向量空間
# Ex: [[5],[20]] -> [[0.17,0.66],[0.82,0.53]]
# vb_size 字典長度
model.add(Embedding(vb_size, 8, input_length=max_length))model.add(Flatten())#二元分類
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

這次為word embedding範例，下次有機會會介紹Word2Vector的方法(skip gram & CBOW)，簡單說CBOW模型能根據输入周圍的字詞來預測目標字詞，而Skip-gram模型則是能夠利用本身字詞來預測周圍字詞。

感謝～

參考來源：

How to Use Word Embedding Layers for Deep Learning with Keras - Machine Learning Mastery

Word embeddings provide a dense representation of words and their relative meanings. They are an improvement over…

machinelearningmastery.com

Keras中文文档

Edit description

keras-cn.readthedocs.io

erhwenkuo/deep-learning-with-keras-notebooks

deep-learning-with-keras-notebooks - Jupyter notebooks for using & learning Keras

github.com

科技大擂台

import numpy as np dim = 0 word_vecs= {} with open('cna.cbow.cwe_p.tar_g.512d.0.txt') as f: for line in f: tokens =…

fgc.stpi.narl.org.tw