[ML] 用在NLP 的 TF-IDF

Tim Wong

Published in

深思心思

4 min readDec 23, 2019

[ML] 用在NLP 的 TF-IDF

日期: 2019-Oct-26, 作者: Tim Wong

在 NLP Natural Language Processing 中，一定會要處理很多文章(docume nts)，而文章中我們會數一數每個keyword (features) 出現的次數。數來幹嗎? 當然是看看某一個被關注的keyword (例如：減肥) 在哪一編文章中出現最多，「keyword 出現最多次數的文章，必定是一編跟這個keyword 有重要關聯的文章了!」。真的嗎?

不是的! 首先(1) 看文章總字數，字數上萬，有多幾個’減肥' keyword 是很正常的 (2) 如果有一萬編有關減肥的文章，那麼，’減肥' 這個 keyword 就沒那麼重要了。反而要小心這些太普遍的字誤導了NLP 分柝。

TF-IDF (term frequence — inverse document frequency) 正正就是來處理這幾個問題。

TF (term frequence) 就是某keyword 的出現次數。scikit-learn 例子：

corpus = np.array([[3, 0, 1],
                   [2, 0, 0],
                   [3, 0, 0],
                   [4, 0, 0],
                   [3, 2, 0],
                   [3, 0, 2]])解說: 
文件一 [3,0,1] 中 keyword-1 出現了3次，keyword-2 出現了0次，keyword-3 出現了1次. 
這個corpus 總共有6份文章。

IDF (inverse document frequency) 是這樣定義的：

idf_of_a_keyword = log(N/df) + 1where：
N = 總數6份文章 ＝ 6
df = keyword-N 在多是份文章中出現過 (出現了是1，沒出現是0）所以 df of this corpus 是 [6, 1, 2], i.e. keyword-1在6份文章都有出現，keyword-2出現在1份文章中，keyword-3 在兩份文章中出現。#------------------------
用Python 可以這樣找到它們：
corpus.shape[0] = 6
sum(corpus>0)*1 = [6,1,2]inverse_doc_freq = np.log(corpus.shape[0]/sum((corpus > 0)*1)) + 1
#------------------------

當有了tf 及 idf 之後，相乘就有tf-idf 了。

def euclidean_dist_normalize(list_a):
    result = sum([i**2 for i in list_a]) ** 0.5
    return [i/result for i in list_a]tf_idf = np.array([ele_multi(doc, inverse_doc_freq) for doc in corpus])

scikit-learn 還多做了一件事，就是對每一份文章的各features 做 Euclidane Distance Normalization。即：

v_norm = v/sqrt(v_1^2+v_2^2+v_3^2+...+v_n^2)

這樣做可以令每個比列都在0–1之間，亦表示出某keyword 在一份文章中所佔的比重。

import numpy as npdef euclidean_dist_normalize(list_a):
    result = sum([i**2 for i in list_a]) ** 0.5
    return [i/result for i in list_a]
    
def ele_multi(list1,list2):
    return [i*j for i,j in zip(list1,list2)]corpus = np.array([[3, 0, 1],
                   [2, 0, 0],
                   [3, 0, 0],
                   [4, 0, 0],
                   [3, 2, 0],
                   [3, 0, 2]])inverse_doc_freq = np.log(corpus.shape[0]/sum((corpus > 0)*1)) + 1
tf_idf = np.array([euclidean_dist_normalize(ele_multi(doc, inverse_doc_freq)) for doc in corpus])print(tf_idf)

全力衝刺中的一團火

我是阿Tim | timwong.ai@gmail.com

[ML] 用在NLP 的 TF-IDF

Written by Tim Wong