Different types of Text Similarity Approaches
In various tasks such as information retrieval, document clustering, word-sense disambiguation, machine translation and text summarization, it is essential to measure the similarity between words, sentences, paragraphs and documents. This post discusses the three different types of text similarity approaches: String-based, Corpus-based and Knowledge based. Furthermore, some example implementations using python libraries of some approaches are shown.
String-Based Similarity
A string similarity or distance takes into account the degree to which two strings match with each other.
String-Based Similarity can be further classified as Character-Based Similarity Measures and Corpus-Based Similarity
Character-Based Similarity Measure
LCS is a common example of Character-Based Similarity Measure
Longest Common SubString (LCS) algorithm considers the maximum length of contiguous chain of characters that exist in both strings.
def longestSubstring(str1,str2): seqMatch = SequenceMatcher(None,str1,str2)
match = seqMatch.find_longest_match(0, len(str1), 0, len(str2)) if (match.size!=0):
print (str1[match.a: match.a + match.size])
else:
print ('None')sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."
print('longest substring between sent1 and sent2 : ',sent_1_2)
The output:
longest substring between sent1 and sent2 : if possible
Another example of Character-Based Similarity Measure is Levenshtein edit distance. It defines distance between two strings by counting the minimum number of operations(insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters) needed to transform one string into the other.
sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."sent_1_2 = nltk.edit_distance(sent1, sent2)
print(sent_1_2, 'Edit Distance between sent1 and sent2')
The output:
22 Edit Distance between sent1 and sent2
Term-based Similarity Measures
Cosine similarity is a measure of similarity between two vectors that measures the cosine of the angle between them.
Euclidean distance or L2 distance is the square root of the sum of squared differences between corresponding elements of the two vectors
def compute_vectors(*strs):
text = [t for t in strs]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()def compute_cosine_sim(*strs):
vectors = [t for t in compute_vectors(*strs)]
return cosine_similarity(vectors)def compute_euc_dis(*strs):
vectors = [t for t in compute_vectors(*strs)]
return euclidean_distances(vectors)sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."print("cosine_sim",compute_cosine_sim(s1,s2))
print("euclidean_dis",compute_euc_dis(s1,s2))
The output :
cosine_sim [[1. 0.58925565]
[0.58925565 1. ]]euclidean_dis [[0. , 2.64575131],
[2.64575131, 0. ]]
Dice’s coefficient is defined as twice the number of common terms in the two strings divided by the total number of terms in both strings
Jaccard similarity is computed as the number of shared terms over union of all the terms in both strings
Overlap coefficient considers two strings a full match if one is a subset of the other.
def compute_jaccard_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))def compute_dice_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return 2*float(len(c)) / (len(a) + len(b))def compute_overlap_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / min(len(a) , len(b) )sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."print("jaccard: "compute_jaccard_sim(sent1, sent2)
print("dice: ",compute_dice_sim(sent1, sent2)
print("overlap: ",compute_overlap_sim(sent1, sent2)
The output :
jaccard: 0.4166666666666667
dice: 0.5882352941176471
overlap: 0.625
Corpus-Based Similarity
Corpus-Based similarity determines the semantic similarity between words according to information gained from a large corpora. Pointwise Mutual Information is an example of corpus based similarity.
Pointwise Mutual Information — Information Retrieval is a method for computing the similarity between pairs of words The more often two words co-occur near each other on a web page, the higher is their PMI-IR similarity score.
text = “this is a foo bus red car foo bus bus blue car foo bar bar red car shep bus bus blue”bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))for i in finder.score_ngrams(bigram_measures.pmi):
print(i)
The output :
(('is', 'a'), 4.392317422778761)
(('this', 'is'), 4.392317422778761)
(('a', 'foo'), 2.8073549220576046)
(('car', 'shep'), 2.8073549220576046)
(('red', 'car'), 2.8073549220576046)
(('bar', 'bar'), 2.3923174227787607)
(('bar', 'red'), 2.3923174227787607)
(('car', 'foo'), 2.222392421336448)
(('shep', 'bus'), 2.0703893278913985)
(('bus', 'blue'), 2.070389327891398)
(('blue', 'car'), 1.8073549220576046)
(('foo', 'bar'), 1.8073549220576046)
(('foo', 'bus'), 1.485426827170242)
(('bus', 'red'), 1.070389327891398)
(('bus', 'bus'), 0.7484612330040363)
Knowledge-Based Similarity
Knowledge-Based Similarity measures the degree of similarity between words using information derived from semantic networks. WordNet is the most popular semantic network. It is a large lexical database of English words tagged as Nouns, verbs, adjectives and adverbs and the words are grouped into sets of synonyms (synsets), each expressing a distinct concept.
Resnik Similarity is based on the Information Content (IC) of the Least Common Subsumer (lowest node in the hierarchy that is a hypernymn).
Jiang-Conrath Similarity is based on the Information Content (IC) of the Least Common Subsumer and that of the two input Synsets.
Lin Similarity is based on the Information Content (IC) of the Least Common Subsumer and that of the two input Synsets.
#retrieving IC of the brown corpus
from nltk.corpus import wordnet_ic
from nltk.corpus import wordnet as wn
brown_ic = wordnet_ic.ic('ic-brown.dat')#looking up noun words 'rat' and 'lion' using synset()
rat = wn.synset('rat.n.01')
lion = wn.synset('lion.n.01')print("resnick: "rat.res_similarity(lion, genesis_ic))
print("jc: "rat.res_similarity(lion, genesis_ic))
print("lin: "rat.res_similarity(lion, genesis_ic))
The output:
resnick: 4.665415658815678
jc: 0.08207149300038069
lin: 0.5288091238271396
References
A Survey of Text Similarity Approaches, . Gomaa and Fahmy, International Journal of Computer Applications
http://www.nltk.org/howto/wordnet.html
nlp course slides - IIT Gandhinagar (https://sites.google.com/a/iitgn.ac.in/nlp-2018/)