八步计算文本相似度
tf-idf方法的实践
Published in
6 min readMay 7, 2017
准备工作和模型建立
引入文本处理库gensim,手工制造一些原始数据。
import gensim
raw_documents = ["I'm taking the show on the road.",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
print("Number of documents:",len(raw_documents))
使用NLTK进行分词。
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
print(gen_docs)
# [['i', "'m", 'taking', 'the', 'show', 'on', 'the', 'road', '.'], ['my', 'socks', 'are', 'a', 'force', 'multiplier', '.'], ['i', 'am', 'the', 'barber', 'who', 'cuts', 'everyone', "'s", 'hair', 'who', 'does', "n't", 'cut', 'their', 'own', '.'], ['legend', 'has', 'it', 'that', 'the', 'mind', 'is', 'a', 'mad', 'monkey', '.'], ['i', 'make', 'my', 'own', 'fun', '.']]
创建词典,映射所有单词。
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
# on
print(dictionary.token2id['road'])
# 0
print("Number of words in dictionary:",len(dictionary))
# Number of words in dictionary: 36
for i in range(len(dictionary)):
print(i, dictionary[i])0 road
1 taking
2 .
3 i
4 show
5 on
6 the
7 ‘m
8 force
9 socks
10 a
11 my
12 are
13 multiplier
14 cut
15 own
16 their
17 hair
18 n’t
19 who
20 barber
21 cuts
22 ‘s
23 does
24 am
25 everyone
26 mad
27 legend
28 has
29 it
30 is
31 that
32 monkey
33 mind
34 make
35 fun
创建词袋模型语料库,记录每个文本的词频。
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
# [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1)], [(2, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)], [(2, 1), (3, 1), (6, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(2, 1), (6, 1), (10, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(2, 1), (3, 1), (11, 1), (15, 1), (34, 1), (35, 1)]]
从语料库中生成tf-idf模型。tf-idf的全称是term frequency-inverse document frequency,Term frequency代表文档中单词出现的数量,inverse document fequency代表单词在文档中的稀有程度。最后打印出的num_nnz
是词的总量。
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
# TfidfModel(num_docs=5, num_nnz=47)
在tf-idf模型中创建相似度模型。完工!
sims = gensim.similarities.Similarity('./',tf_idf[corpus],num_features=len(dictionary))
print(sims)
# Similarity index with 5 documents in 0 shards (stored under ./)
print(type(sims))
# <class 'gensim.similarities.docsim.Similarity'>
测试
创建一个测试文档,并转换成tf-idf。
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
# ['socks', 'are', 'a', 'force', 'for', 'good', '.']
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
# [(2, 1), (8, 1), (9, 1), (10, 1), (12, 1)]
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
# [(8, 0.5484803253891997), (9, 0.5484803253891997), (10, 0.31226270667960454), (12, 0.5484803253891997)]
相似度查询。可以看到和第二句的相似度是最高的。
sims[query_doc_tf_idf]
# array([ 0. , 0.84565616, 0. , 0.06124881, 0. ], dtype=float32)