八步计算文本相似度

tf-idf方法的实践

sdq

Published in

Explore, Think, Create

6 min readMay 7, 2017

准备工作和模型建立

引入文本处理库gensim，手工制造一些原始数据。

import gensim
raw_documents = ["I'm taking the show on the road.",
                 "My socks are a force multiplier.",
                 "I am the barber who cuts everyone's hair who doesn't cut their own.",
                 "Legend has it that the mind is a mad monkey.",
                 "I make my own fun."]
print("Number of documents:",len(raw_documents))

使用NLTK进行分词。

from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)] 
            for text in raw_documents]
print(gen_docs)
# [['i', "'m", 'taking', 'the', 'show', 'on', 'the', 'road', '.'], ['my', 'socks', 'are', 'a', 'force', 'multiplier', '.'], ['i', 'am', 'the', 'barber', 'who', 'cuts', 'everyone', "'s", 'hair', 'who', 'does', "n't", 'cut', 'their', 'own', '.'], ['legend', 'has', 'it', 'that', 'the', 'mind', 'is', 'a', 'mad', 'monkey', '.'], ['i', 'make', 'my', 'own', 'fun', '.']]

创建词典，映射所有单词。

dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
# on
print(dictionary.token2id['road'])
# 0
print("Number of words in dictionary:",len(dictionary))
# Number of words in dictionary: 36
for i in range(len(dictionary)):
    print(i, dictionary[i])0 road 
1 taking 
2 . 
3 i 
4 show 
5 on 
6 the 
7 ‘m 
8 force 
9 socks 
10 a 
11 my 
12 are 
13 multiplier 
14 cut 
15 own 
16 their 
17 hair 
18 n’t 
19 who 
20 barber 
21 cuts 
22 ‘s 
23 does 
24 am 
25 everyone 
26 mad 
27 legend 
28 has 
29 it 
30 is 
31 that 
32 monkey 
33 mind 
34 make 
35 fun

创建词袋模型语料库，记录每个文本的词频。

corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
# [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1)], [(2, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)], [(2, 1), (3, 1), (6, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(2, 1), (6, 1), (10, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(2, 1), (3, 1), (11, 1), (15, 1), (34, 1), (35, 1)]]

从语料库中生成tf-idf模型。tf-idf的全称是term frequency-inverse document frequency，Term frequency代表文档中单词出现的数量，inverse document fequency代表单词在文档中的稀有程度。最后打印出的num_nnz是词的总量。

tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
# TfidfModel(num_docs=5, num_nnz=47)

在tf-idf模型中创建相似度模型。完工！

sims = gensim.similarities.Similarity('./',tf_idf[corpus],num_features=len(dictionary))
print(sims)
# Similarity index with 5 documents in 0 shards (stored under ./)
print(type(sims))
# <class 'gensim.similarities.docsim.Similarity'>

测试

创建一个测试文档，并转换成tf-idf。

query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
# ['socks', 'are', 'a', 'force', 'for', 'good', '.']
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
# [(2, 1), (8, 1), (9, 1), (10, 1), (12, 1)]
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
# [(8, 0.5484803253891997), (9, 0.5484803253891997), (10, 0.31226270667960454), (12, 0.5484803253891997)]

相似度查询。可以看到和第二句的相似度是最高的。

sims[query_doc_tf_idf]
# array([ 0.        ,  0.84565616,  0.        ,  0.06124881,  0.        ], dtype=float32)

八步计算文本相似度

tf-idf方法的实践

准备工作和模型建立

测试

Written by sdq