Text Similarities : Estimate the degree of similarity between two texts

Adrien Sieg
Jul 4, 2018 · 27 min read

What is text similarity?

What is our winning strategy?

What do we mean by different embeddings?

- Bag of Words (BoW)
- Term Frequency - Inverse Document Frequency (TF-IDF)
- Continuous BoW (CBOW) model and SkipGram model embedding(SkipGram)
- Pre-trained word embedding models :
-> Word2Vec (by Google)
-> GloVe (by Stanford)
-> fastText (by Facebook)
- Poincarré embedding
- Node2Vec embedding based on Random Walk and Graph

A very sexy approach [ Knowledge-based Measures (wordNet)] [Bonus]

0. Jaccard Similarity ☹☹☹:

Jaccard Similarity Principle
Jaccard Similarity Function
Image result for obama speaks to the media in illinois word mover distance jaccard similarity
Why Jaccard Similarity is not efficient?

1. K-means and Hierarchical Clustering Dendrogram☹:

corpus = [‘The sky is blue and beautiful.’,
‘Love this blue and beautiful sky!’,
‘The quick brown fox jumps over the lazy dog.’,
“A king’s breakfast has sausages, ham, bacon, eggs, toast and beans”,
‘I love green eggs, ham, sausages and bacon!’,
‘The brown fox is quick and the blue dog is lazy!’,
‘The sky is very blue and the sky is very beautiful today’,
‘The dog is lazy but the brown fox is quick!’,
President greets the press in Chicago’,
Obama speaks in Illinois
]

2. Cosine Similarity ☹:

CountVectorizer Method + Cosine Similarity ☹

Pre-trained Method (such as Glove) + Cosine Similarity 😊

Smooth Inverse Frequency

3. Latent Semantic Indexing (LSI)

4. Word Mover’s Distance

I have 2 sentences:

Removing stop words:

Equal-Weight Distributions

A minimum work flow

Flow

Image result for obama speaks to the media in illinois word mover distance jaccard similarity

5. LDA with Jensen-Shannon distance

Amazing ❤ https://www.kaggle.com/ktattan/lda-and-document-similarity
https://www.kaggle.com/ktattan/lda-and-document-similarity

Beyond Cosine: A Statistical Test.

6. Variational Auto Encoder

Image result for autoencoder
https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases
http://blog.qure.ai/notes/using-variational-autoencoders

7. Pre-trained sentence encoders

8. Siamese Manhattan LSTM (MaLSTM)

http://www.erogol.com/duplicate-question-detection-deep-learning/

9. A word about Knowledge-based Measures

SOURCES

Adrien Sieg

Written by

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade