Using NLP to create suggestions

Ritwick Raj
AnitaB.org Open Source
2 min readAug 11, 2020

Natural Language Processing is a part of statistical learning which deals with language and its perspectives. Right from sentiment analysis to trend classification, NLP is being used widely in today’s world to target audience and make user experience better.

During the last week of phase 2 of GSoC’20, I had to work on creating suggestions of meetups to users. Now whenever we have to do any statistical analysis in any field, be it image processing or Natural Language Processing, the first thing we need is data.

Deciding the way out

The Portal being in its development phase, there was not much data available on the basis of user interaction with the website, which is the way the big companies such as Facebook or Google use to generate suggestions.

So, using deep learning models was not possible due to the lack of data.The immediate next method which popped in was using statistical methods rather than deep learning models.

Pre-processing Data for Accurate Results

The description would contain various stop words and punctuations which needs to be removed from the data so that when the word tokens are created, they don’t contaminate the score. The following steps helped preprocess the code.

for w in word_tokens1:
if w not in stop_words:
filtered_sentence1.append(w)
data1 = ' '.join(filtered_sentence1)

The stop words were taken from the stock of stop words from nltk library

Generating Keywords out of Meetup Description using TF-IDF

Using Scikit-Learn, I first attempted to fetch out all the keywords from descriptions and mark them as tags. And using tags from other meetup descriptions, I would fetch the closest meetups. So the following code snippet would help perform the above:

  • Creating the word vectors:
cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000)
word_count_vector=cv.fit_transform(docs)
  • Using Scikit Learn’s Tf-idf to find IDF(Inverse Document Frequency):
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
  • Computing TF-IDF
#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

And then finally matching the tags and keywords would give the nearest results.

The Inaccuracy in Results

The major problem with this technique was that most of the times the keywords wouldn’t match given the keywords would be too distinct to match the others.

So then I shifted to the next method of matching similar documents, using Gensim and NLTK.

Below is the code for the comparing function:

The steps above till TF-IDF remains the same, but when it would come to comparing, by creating gensim’s similarity object, we would average out the similarity score and return a similarity percentage.

This is how, using meetup description, it became possible to find suggestions.

Sources: https://github.com/raszidzie/Resemblance

--

--