Semantic “Similar Sentences” with your dataset-NLP

2 min readApr 17, 2020

Finding a similar sentence nowadays is an easy task with a methodology in Natural Language Processing (NLP).

Before explaining further here are some basic things to understand,

NLP is a computer program which understands human language.
NLP is a component (or) branch of Artificial Intelligence (AI).
Machine Learning (ML) is a subset of AI, it is an algorithm to build a mathematical model based on sample data.

Finding a similar sentence can be done with the Embedding technique in NLP. In this technique, the given sentences will be converted to numbers (tokenizing) and then represented in a Vector format. The vector representation helps us finding the nearest word or sentence from the trained model with cosine similarity or distance algorithm.

In this article, we will be using “gensim” python package and Doc2Vec method to find similar sentences.

Prerequisite in your machine,

pip install gensim
pip install nltk

Consider training the model with your dataset. Create a file called “data.py” with below content

# data.py

data = [
  'Happy to help!',
  'Glad I could help!',
  'See you soon!',
  'How can I help you today?',
  'Anything I can help with today?',
  'What can I help with today?',
  'How can I help?',
  'Anything else before we wrap up?',
  'Anything else I can help you with?',
  'Anything else I can tell you?',
  'Anything else I can clear up?'
]

Create a file train.py with below content. Start your training with Gensim using code below,

# train.py

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from data import data

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

# hyper parameters
max_epochs = 500
vec_size =150
alpha = 0.03
minimum_alpha = 0.0025
reduce_alpha = 0.0002

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=minimum_alpha,
                dm =1)
model.build_vocab(tagged_data)

# Train the model based on epochs parameter
for epoch in range(max_epochs):
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)

# Save model. 
model.save("similar_sentence.model")

Create a file called similar_sentence To find similar sentence use the below code,

# similar_sentence.py

from gensim.models.doc2vec import Doc2Vec
from data import data

model= Doc2Vec.load("similar_sentence.model")

def output_sentences(most_similar):
    print('\n')
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
      print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])]))
    print('=====================\n')
 
seed_text = 'Is there anything else?'
tokens = seed_text.split()
vector = model.infer_vector(tokens)
most_similar = model.docvecs.most_similar([vector]) 

output_sentences(most_similar)

The order of execution has to be like below:

python train.py
python similar_sentence.py # replace the seed_text with your sentece

The output of the above sentence 'Is there anything else?' will produce the below output.

# Output 

MOST 0.14213204383850098: Glad I could help!
SECOND-MOST 0.13624146580696106: Happy to help!
MEDIAN 0.05883193761110306: Anything I can help with today?
LEAST -0.039655789732933044: See you soon!

Demo time: Given below the Google Colab notebook link for your reference. This is a very basic training model which will have low accuracy in the results based on the dataset.

Demo: https://colab.research.google.com/drive/1W5yvUDjgidfiwIgo9ZMiztbaLyYOW2V5

Semantic “Similar Sentences” with your dataset-NLP

Written by Shankar Ganesh Jayaraman