Pointer-Generator-Network

任書瑋
Data Scientists Playground
9 min readDec 12, 2019

這裏記錄下很有意思的seq2se2的作法, 傳統的seq2seq會把Input seq 壓縮成某種訊息向量(Encoder), 在decoder時只會用之前產生的Outputs seq和Encoder 向量一起預測下一個字的機率

圖片出處https://arxiv.org/pdf/1704.04368.pdf

文章摘要這個任務非常符合seq2seq的架構, Encode 整篇文章, Decode 出此文章的摘要, 是典型的seq2seq任務, 假設有一篇運動比賽的報導

休士頓火箭今(12日)前往克里夫蘭騎士作客,面對戰績東區倒數第2的球隊,球星「大鬍子」James Harden單場得分再度飆破50分,並狂轟猛炸本季新高的單場10顆三分彈,終場率隊以116:110收下勝利。…

摘要為

NBA/狂轟10三分!哈登屠殺騎士

但以字詞向量的角度來說, 數字向量應該都非常接近, 球隊名向量也是互相接近的, 如果把新聞內容改成

10顆三分彈-> 50顆三分彈, 克里夫蘭騎士->底特律活塞

應該對於整個Encode向量來說幾乎不會造成影響, 但Output 卻要大幅改變, 這對於原本的seq2se2來說是很難的一件事, 因此這篇論文就提供了一個有趣的想法

在Output 提供了一個機制(P_gen)能從 Inputs 輸出相對應的字, 也就是說Input seq的某字當目前的輸出

https://arxiv.org/pdf/1704.04368.pdf

如上圖, 當要產生Output時會先計算出P_gen, 有(1-P_gen)的機率是從Input seq挑字, P_gen的機率是直接產生下一個字, 對於同一個字的機率是累加的, 也就是上面Argentina

概念很簡單不過要如何用程式實現從Input seq挑字並把機率累加呢, 可以參考以下程式, 以下會慢慢講解

"""
Args:
x: A tensor with (batchSize, xl)
vocab_dists: A tensor with (batchSize, yl, numWord)
attn: The attention distributions. (batchSize, yl, xl)
contact: A tensor with (batchSize, yl, 2embeddingSize)
Returns:
final_dists: The final distributions. (batchSize, yl, numWord)
"""
vocab_size = vocab_dists.get_shape().as_list()[-1]
gens = tf.layers.dense(contact, units=1, activation=tf.sigmoid, use_bias=False) # (batch, yl, 1)
# Multiply vocab dists by p_gen and attention dists by (1-p_gen)
vocab_dists = gens * vocab_dists # (batch, yl, numWord)
attn_dists = (1-gens) * attn # (batch, yl, xl)
batch_size = tf.shape(attn_dists)[0] # batchSize
yl = tf.shape(attn_dists)[1] # yl
xl = tf.shape(attn_dists)[2] # xl
dec = tf.range(0, limit=yl) # [yl]
dec = tf.expand_dims(dec, axis=-1) # [yl, 1]
dec = tf.tile(dec, [1, xl]) # [yl, xl]
dec = tf.expand_dims(dec, axis=0) # [1, yl, xl]
dec = tf.tile(dec, [batch_size, 1, 1]) # [batchSize, yl, xl]
'''
dec delete batch
xl
[0 0 0]
[1 1 1]
[2 2 2] yl
: : :
[yl-1 yl-1 yl-1]
'''
x = tf.expand_dims(x, axis=1) # [batchSize, 1, xl]
x = tf.tile(x, [1, yl, 1]) # [batchSize, yl, xl]
'''
x delete batch
xl
[enc sentence word2id]
: yl
[enc sentence word2id]
'''
indices = tf.stack([dec, x], axis=3) # [batchSize, yl, xl, 2]
'''
y[0] = indices (batchSize, yl, xl, 2)
y[1] = attn_dists (batchSize, yl, xl)
tf.scatter_nd(indices=[yl, xl, 2], updates=[yl, xl], shape=[yl, vocab_size])
k = [0, l-1]
indices(yk, xk ,:) means a 2D coordinate location in shape=[yl, vocab_size], and its value is updates(yk, xk)
example in test.py
'''
attn_dists_projected = tf.map_fn(
fn=lambda y: tf.scatter_nd(y[0], y[1], [yl, vocab_size]),
elems=(indices, attn_dists),
dtype=tf.float32)
final_dists = attn_dists_projected + vocab_dists

vocab_dists 是直接產生下一個字的分佈(圖上綠色的分佈)

我們使用contact算出P_gen, 最重要的是 contact 的維度要跟yl 有關, 因為之後是每一次decode都要一個P_gen 數值

gens = tf.layers.dense(contact, units=1, activation=tf.sigmoid, use_bias=False)  # (batch, yl, 1)

attn的維度為(batchSize, yl, xl), 圖上藍色的Attention Distribution, 這裏為了Decode時每次都能有不同的Distribution所以有個yl維度

attn_dists = (1-gens) * attn  # (batch, yl, xl)

接下來要製作一個關於yl長度和Inputs word2id的矩陣

indices = tf.stack([dec, x], axis=3) # [batchSize, yl, xl, 2]

這裏有個例子 yl = 2, 此 encode 有三個字(xl = 3), wor2id對應為[4, 2, 6], batch size = 1, vocab_size = 10

attn_dists = tf.constant([[[1.1, 1.3, 1.4] , [1.5, 1.6, 1.7]]],dtype=tf.float32, name=None)

dec

x

indices會變成, 也代表座標=[yl, word2id]

最後用tf.map_fn將此attn_dists分佈填入, 這裏我們設定vocab_size = 10

這樣就能跟vocab_dists 相加了

vocab_dists = gens * vocab_dists  # (batch, yl, numWord)
final_dists = attn_dists_projected + vocab_dists

Reference

--

--