ИOYA
5 min readDec 2, 2018

Different types of Text Similarity Approaches

In various tasks such as information retrieval, document clustering, word-sense disambiguation, machine translation and text summarization, it is essential to measure the similarity between words, sentences, paragraphs and documents. This post discusses the three different types of text similarity approaches: String-based, Corpus-based and Knowledge based. Furthermore, some example implementations using python libraries of some approaches are shown.

String-Based Similarity

A string similarity or distance takes into account the degree to which two strings match with each other.

String-Based Similarity can be further classified as Character-Based Similarity Measures and Corpus-Based Similarity

Character-Based Similarity Measure

LCS is a common example of Character-Based Similarity Measure

Longest Common SubString (LCS) algorithm considers the maximum length of contiguous chain of characters that exist in both strings.

def longestSubstring(str1,str2):     seqMatch = SequenceMatcher(None,str1,str2) 

match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
if (match.size!=0):
print (str1[match.a: match.a + match.size])
else:
print ('None')
sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."
print('longest substring between sent1 and sent2 : ',sent_1_2)

The output:

longest substring between sent1 and sent2 : if possible

Another example of Character-Based Similarity Measure is Levenshtein edit distance. It defines distance between two strings by counting the minimum number of operations(insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters) needed to transform one string into the other.

sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."
sent_1_2 = nltk.edit_distance(sent1, sent2)
print(sent_1_2, 'Edit Distance between sent1 and sent2')

The output:

22 Edit Distance between sent1 and sent2

Term-based Similarity Measures

Cosine similarity is a measure of similarity between two vectors that measures the cosine of the angle between them.

Euclidean distance or L2 distance is the square root of the sum of squared differences between corresponding elements of the two vectors

def compute_vectors(*strs):
text = [t for t in strs]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def compute_cosine_sim(*strs):
vectors = [t for t in compute_vectors(*strs)]
return cosine_similarity(vectors)
def compute_euc_dis(*strs):
vectors = [t for t in compute_vectors(*strs)]
return euclidean_distances(vectors)
sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."
print("cosine_sim",compute_cosine_sim(s1,s2))
print("euclidean_dis",compute_euc_dis(s1,s2))

The output :

cosine_sim [[1.         0.58925565]
[0.58925565 1. ]]
euclidean_dis [[0. , 2.64575131],
[2.64575131, 0. ]]

Dice’s coefficient is defined as twice the number of common terms in the two strings divided by the total number of terms in both strings

Jaccard similarity is computed as the number of shared terms over union of all the terms in both strings

Overlap coefficient considers two strings a full match if one is a subset of the other.

def compute_jaccard_sim(str1, str2): 
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
def compute_dice_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return 2*float(len(c)) / (len(a) + len(b))
def compute_overlap_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / min(len(a) , len(b) )
sent1 = "It might help to study nlp if possible."
sent2 = "It can help to play football again if possible."
print("jaccard: "compute_jaccard_sim(sent1, sent2)
print("dice: ",compute_dice_sim(sent1, sent2)
print("overlap: ",compute_overlap_sim(sent1, sent2)

The output :

jaccard: 0.4166666666666667
dice: 0.5882352941176471
overlap: 0.625

Corpus-Based Similarity

Corpus-Based similarity determines the semantic similarity between words according to information gained from a large corpora. Pointwise Mutual Information is an example of corpus based similarity.

Pointwise Mutual Information — Information Retrieval is a method for computing the similarity between pairs of words The more often two words co-occur near each other on a web page, the higher is their PMI-IR similarity score.

text = “this is a foo bus red car foo bus bus blue car foo bar bar red car shep bus bus blue”bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
print(i)

The output :

(('is', 'a'), 4.392317422778761)
(('this', 'is'), 4.392317422778761)
(('a', 'foo'), 2.8073549220576046)
(('car', 'shep'), 2.8073549220576046)
(('red', 'car'), 2.8073549220576046)
(('bar', 'bar'), 2.3923174227787607)
(('bar', 'red'), 2.3923174227787607)
(('car', 'foo'), 2.222392421336448)
(('shep', 'bus'), 2.0703893278913985)
(('bus', 'blue'), 2.070389327891398)
(('blue', 'car'), 1.8073549220576046)
(('foo', 'bar'), 1.8073549220576046)
(('foo', 'bus'), 1.485426827170242)
(('bus', 'red'), 1.070389327891398)
(('bus', 'bus'), 0.7484612330040363)

Knowledge-Based Similarity

Knowledge-Based Similarity measures the degree of similarity between words using information derived from semantic networks. WordNet is the most popular semantic network. It is a large lexical database of English words tagged as Nouns, verbs, adjectives and adverbs and the words are grouped into sets of synonyms (synsets), each expressing a distinct concept.

Resnik Similarity is based on the Information Content (IC) of the Least Common Subsumer (lowest node in the hierarchy that is a hypernymn).

Jiang-Conrath Similarity is based on the Information Content (IC) of the Least Common Subsumer and that of the two input Synsets.

Lin Similarity is based on the Information Content (IC) of the Least Common Subsumer and that of the two input Synsets.

#retrieving IC of the brown corpus
from nltk.corpus import wordnet_ic
from nltk.corpus import wordnet as wn
brown_ic = wordnet_ic.ic('ic-brown.dat')
#looking up noun words 'rat' and 'lion' using synset()
rat = wn.synset('rat.n.01')
lion = wn.synset('lion.n.01')
print("resnick: "rat.res_similarity(lion, genesis_ic))
print("jc: "rat.res_similarity(lion, genesis_ic))
print("lin: "rat.res_similarity(lion, genesis_ic))

The output:

resnick: 4.665415658815678
jc: 0.08207149300038069
lin: 0.5288091238271396

References

A Survey of Text Similarity Approaches, . Gomaa and Fahmy, International Journal of Computer Applications

http://www.nltk.org/howto/wordnet.html

nlp course slides - IIT Gandhinagar (https://sites.google.com/a/iitgn.ac.in/nlp-2018/)