Similarity Queries and Text Summarization in NLP

Aravind CR
The Startup
Published in
5 min readJun 14, 2020

Before diving directly into similarity queries it is important to know what a similarity metric is.

Similarity Metric

Similarity metrics are mathematical constructs which is particularly useful in NLP — especially information retrieval. We can understand a metric as a function that defines the distance between each pair of elements of a set or vectors. We can see how this can be useful — we can compare between how similar 2 documents would be based on the distance. If a lower value is returned by a distance function then it is known that 2 documents are similar and vice-versa.

We can technically compare any 2 elements in the set — this also means that we compare 2 sets of topics created by topic model (If you don’t know what a topic model is please do checkout my previous blog Topic modeling-LDA)

Most of them would be aware of one distance metric — Eucledian metric . It a distance metric we come across in high school mathematics, and we would have likely seen it being used to calculate the distance between two points in a 2-dimensional space (XY).

Gensim, scikit-learn and most other ML packages recognize the importance of distance metric and have them implemented in their package.

Similarity queries

Its demostrated in the below implementation that we have the capability to compare between two documents, now its possible for us to set up an algorithm to extract out the most similar documents for an input query.

Index each of the documents, then search for the lowest distance value returned between the corpus and the query, and return the documents with the lowest distance values — these would be most similar. Gensim has in-built structures to do this document similarity task.

The tutorial on the Gensim website performs a similar experiment, but on the Wikipedia corpus — it is a useful demonstration on how to conduct similarity queries on much larger corpuses and is worth checking if you are dealing with a very large corpus.

In the examples, we used LDA models for both distance calculation and to generate the index for the similarities. We can, however, use any vector representation of documents to generate this — it’s up to us to decide which one would be most effective for our use case.

Below attached notebook contains samples for both similarity metrics and queries and has detailed explanation for each and every step, run the cells in the notebook before moving to next topic.

Notebook- Similarity metrics and queries

Text Summarization

Summarization is a process of distiling most important information form source or sources to produce an abridged version for a particular users and tasks.

Text Summarization is a task of condensing piece of text into shortened form while preserving the meaning of the content and overall meaning. Text summarization is important because of the amount of textual data generated is increasing on another level day by day and it becomes difficult to read textual information and often people get bored to read large blocks of text. This is when text summarization came to rescue to make work easier for the readers.

The idea of text summarization is to find the subset of data which contains information of the entire set, but sometimes it results in loss of information.

Main idea —

  • Text processing
  • Word frequency distribution — how many times each words appear in document.
  • Score each sentence depending on the words it contain and the frequency table.
  • Build summary by joining every sentence above a certain score limit.

2 approaches for text summarization are —

  • Extractive summarization (selecting a subset of sentences / extracts objects from the entire collection).
  • Abstractive Summarization (paraphrases/Sentences generated might not be present in the original text).

Summarizing text

In text analysis, it is useful to summarize large bodies of text — either to have brief overlook of the text before deeply analyzing it or identifying keywords from the text.

We will not be working on building our own text summarization pipeline , but rather focus on using the built in summarization API wich gensim offers.

Its also important to know that gensim dosen’t create it own sentences, but rather extracts key sentences from the text which we run the algoirthm on (Extractive summarization). The summarization is based on the text rank algorithm(A graph based ranking model for text processing).

Graph-based ranking algorithms are essentially a way of deciding the importance of a vertex within a graph, based on global information recursively drawn from the entire graph. The basic idea implemented by a graph-based ranking model is that of “voting” or “recommendation”. When one vertex links to another one, it is basically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting the vote determines how important the vote itself is, and this information is also taken into account by the ranking model. Hence, the score associated with a vertex is determined based on the votes that are cast for it, and the score of the vertices casting these votes(you can find this explanation in research paper).

Implementation

Run the cells in the notebook for clear understanding .

Notebook-Summarization

Above code sample contains extractive summarization of text. There are deep learning approaches and pretained models(which have achieved state-of-art-results) available for abstractive summarization of text, check out their implementation to dig-deep into the concept.

A State-of-the-Art Model for Abstractive Text Summarization by Google AI.

One of the recent research for abstractive text summarization where the model predicts the masked sentence for summarization.

Paper: https://lnkd.in/gtaNRF7

GitHub: https://lnkd.in/gKJJ2c9

Read more here: https://lnkd.in/g--hiY6

  • Experiments demonstrate it achieves state-of-the-art performance on 12 downstream datasets measured by ROUGE scores.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.

Conclusion:

Throught the blog we saw how basic mathematical and information retrieval methods can be used to help identify how similar or dissimilar 2 documents are and how text summarization can come in handy for various tasks(Financial research, Question answering and bots, Medical cases, Books and literature, Email overload, Science and R&D,Patent research, Helping disabled people, Programming languages,Automated content creation). Checkout the applications of automatic summarization in enterprise.

— — — — — — — — — — — Thank you — — — — — — — — — — —

Sometimes later becomes never. So do it now.

Keep Learning……………………

--

--