The Keyword Quest: Exploring Automatic Keyword Extractors

Sinhasagar
Accredian
Published in
4 min readJun 22, 2023

Introduction

This article is in continuation of my previous article on The Keyword Quest: Techniques and Challenges, which you can check out here:

Going ahead, I will briefly explore and demonstrate the usage of spacy, rake-NLTK, Supervised Multimodal Multi-document Summarization with Attention (SUMMA), YAKE (Yet Another Keyword Extractor) and KeyBERT for keyword retrieval.

Spacy

Spacy’s doc.ents attribute can act as a preliminary technique for keyword extraction. It provides a starting point for identifying important named entities in a document, which may be relevant for extracting important words and phrases depending on your specific use-case.

However, its important to note that relying solely on named entities may not capture the full range of keywords and concepts present in the document. To obtain a more comprehensive set of keywords, you might want to incorporate other techniques, such as part-of-speech tagging, noun phrase extraction, or frequency-based approaches.

RAKE-NLTK

RAKE (Rapid Automatic Keyword Extraction) is an unsupervised keyword extraction algorithm designed to identify important keywords or key phrases in a given text. It aims to rapidly extract keywords by analyzing word co-occurrence patterns.

The algorithm works by first splitting the text into individual words or phrases and then building a candidate keyword list based on patterns of word co-occurrence. It assigns scores to each candidate keyword based on its frequency and the frequency of its constituent words. The algorithm also considers the degree to which a candidate keyword represents a distinct phrase by examining word boundaries and stopwords.

By analyzing the scores of candidate keywords, RAKE ranks them and selects the top keywords or key phrases as the final extracted keywords. The algorithm is relatively simple yet effective, offering a quick and automated approach for keyword extraction without requiring any training data.

SUMMA

Summa is a Python library that implements the TextRank algorithm for keyword extraction. TextRank is a graph-based approach that assigns weights to words or terms based on their importance and relationships within a text. By iteratively updating the weights, Summa identifies the most significant terms as keywords. It provides an easy way to extract keywords from a given text, aiding in information retrieval and analysis tasks.

YAKE

YAKE (Yet Another Keyword Extractor) is a keyword extraction algorithm that selects the most important keywords from a text using the statistical features of the words. It is designed to be a simple yet effective method for extracting keywords.

The YAKE algorithm follows a two-step process: candidate selection and candidate ranking.

  • Candidate selection: It splits the text into candidate phrases based on statistical features. It considers word sequences that are formed by combining adjacent words using delimiters like white space, line breaks, commas, or periods. The maximum length of the phrases can be adjusted according to the desired keyword length.
  • Candidate ranking: YAKE assigns scores to the candidate phrases based on statistical features such as word frequency, term position, and context diversity. It gives higher importance to phrases that occur more frequently, appear at the beginning of the document, and occur in diverse contexts. YAKE uses a formula that combines these features to calculate a score for each candidate phrase.

YAKE provides a customizable approach to keyword extraction, allowing users to control the number of extracted keywords and adjust the features according to their needs. It can be used for various text analysis tasks, including document summarization, information retrieval, and content analysis.

KeyBERT

KeyBERT is a simple yet highly powerful deep learning based tool for extracting keywords and phrases. It utilizes BERT embeddings and employs cosine similarity to identify sub-phrases within a document that are most similar to the document itself.

The process involves extracting document embeddings using BERT to obtain a representation at the document level. Next, word embeddings are extracted for N-Gram words or phrases. Finally, cosine similarity is utilized to determine the words or phrases that exhibit the highest similarity to the document. These highly similar words or phrases can then be considered as the most descriptive keywords for the entire document.

I have used the following repository as a starting point for getting a glimpse of how KeyBERT can be utilized for this use-case.

Conclusion

This article delved into five libraries, namely Spacy, rake-nltk, SUMMA, YAKE and KeyBERT, that offer unique methodologies for extracting keywords and provide a diverse range of approaches to uncovering the most important terms within a given text. By exploring them, readers can gain insights into distinct approaches for unveiling the essence of a text, whether through co-occurence analysis, statistical features, contextual similarity or graph algorithms. With these options at their disposal, they can select the library that aligns best with their specific requirements, enabling effective and tailored keyword extraction.

--

--