Fun with Topic Extraction

Rohit Arora
Analytics Vidhya
Published in
4 min readAug 3, 2021
Image Source: qz.com

Introduction

I keep switching my experiments between Computer Vision, Natural Language Processing & Machine Learning, Last week I found myself fiddling with Natural Language Processing and I thought of creating a simple topic extractor to extract the topics from an input body of text.

The idea was to build something that even 10-year-old could understand and easily reproduce it on their laptop swiftly. In my opinion, getting something to work on your machine is more satisfying that just skimming through an article theoretically.

PS — I also have a bonus addition to this article to turn the extracted topics into an English sentence. So, stay hooked!

Let’s get rolling!

Pre-Reading

As with most articles, I expect you to understand a few concepts (listed below) beforehand to fully appreciate the code we will write. Don’t be heart-broken if you don’t know these concepts already as our ultimate motive is to build a working model & spark an interest in you with regards Natural Language Processing — learning can always follow later. Here’s the list (and a brief overview anyway):

  • TF-IDF Tokenizer: TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document (the term frequency), and the inverse document frequency of the word across a set of documents.
  • Single Value Decomposition (Principal Component Analysis): Singular value decomposition (SVD) and principal component analysis (PCA) are two eigenvalue methods used to reduce a high-dimensional data set into fewer dimensions while retaining important information.

The Steps

To re-iterate, the task that we would like to accomplish is to extract the key list of topics in any given text. Here is the laundry list of to-dos with their corresponding code snippets where applicable:

  • Zero-in on the sample text that you would like to work on. I will pick the text corresponding to the term “Fintech” in Wikipedia.
  • Convert the text into a list of sentences using sentence tokenizer. We use sent_tokenize for this from NLTK package.
from nltk.tokenize import sent_tokenizepreface = "enter the input text here"
preface_tokens = sent_tokenize(preface)
  • Next, we will clean the input text viz. get rid of the extra whitespaces, convert words to lower case & remove the punctuations.
def clean_text(s):
s = s.lower()
s = s.split()
s = " ".join(s)
s = re.sub(f'[{re.escape(string.punctuation)}]', '', s)
  • Then, we will go on and remove the stopwords. Stopwords are the English words which do not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.
def remove_stop_words(s):
stop_words = set(stopwords.words('english'))
s = s.split()
s = [w for w in s if not w.lower() in stop_words]
s = " ".join(s)
return s
  • Next, we will convert all the words into their base notations viz. agree, agreed, agreeable will all be converted into agree using Lemmatization. For our purpose, we will use WordNetLemmatizer from NLTK package.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
preface_tokens = [lemmatizer.lemmatize(w) for w in preface_tokens]
  • Convert the text into numeric tokens — we need to do this because computers don’t understand text very well. They like to work with numbers. So, we need to come up with an approach to represent the text in the form of numbers. We would use TF-IDF Vectorizer for this. The other choice could have been Count vectorizer, but TF-IDF is known to work better so we will stick with that.
from sklearn.feature_extraction.text import TfidfVectorizertfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
corpus_transformed = tfv.fit_transform(preface_tokens)
  • Once we have extracted the TF-IDF tokens we will embark on computing word scores using Single Value Decomposition. Based on these word scores we will identify the top n words that have the highest score. These top scored words would represent the topics that the input text talks about. We will experiment with n = 5 to see how our output looks.
tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
corpus_transformed = tfv.fit_transform(preface_tokens)
svd = decomposition.TruncatedSVD(n_components=10)
corpus_svd = svd.fit(corpus_transformed)
feature_scores = dict(
zip(
tfv.get_feature_names(),
corpus_svd.components_[0]
)
)
topic_output = sorted(
feature_scores, key=feature_scores.get, reverse=True
)[:5]

The output (for the wikipedia text on the term ‘Fintech’):

['financial', 'technology', 'fintech', 'companies', 'services']

I think the simple strategy we used to extract the topics from the input sentence has worked well. Indeed, we have been able to find out the relevant topics talked about in the input text. But this is just a boring list of topics; let’s try to spice it up a little using the bonus text below.

BONUS!!!

We have been able to extract the list of relevant topics from the input text successfully. Is there a way we can easily convert these words into a sentence that made sense? There is indeed a python package for that known as keytotext.

from keytotext import pipelinedef keytosent(s):
nlp = pipeline("k2t-base")
params = {
"do_sample": True,
"num_beams": 3,
"no_repeat_ngram_size": 4,
"early_stopping": True
}
return nlp(s, **params)

The output sentence:

Fintech is a company that provides services in the financial field of banking.

Not bad — I think keytotext did a decent job of converting a boring list of words into something that made sense.

Let’s see if you can fine tune the parameters to further improve the output sentence. Please do it on your own and share your finding via the comment section.

I do hope that you enjoyed this article — I’ll be really grateful if you are able to leave a rating / feedback below.

Credits:

  1. Approaching (almost) Any Machine Learning Problem by Abhishek Thakur
  2. Prakhar Mishra — https://medium.com/mlearning-ai/generating-sentences-from-keywords-using-transformers-in-nlp-e89f4de5cf6b

--

--

Rohit Arora
Analytics Vidhya

Rohit works as Dev Mgr with a UK based Investment Bank. He is a deep learning enthusiast conjuring up ML based cost optimisation solutions in Finance & Banking.