Keyword Extractor YAKE!

Aditya Mishra
6 min readMay 7, 2022

One major problem that arises when dealing with Textual Data is Summarization or more precisely identifying the top words that represents the whole Text in Hand. Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that texts can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Despite the advances, there is a clear lack of multilingual online tools to automatically extract keywords from single documents. A lot of Questions would have came into your mind after reading these statement. What is real-time application of text summarization? What top words mean here? How we can extract certain words from a big text chunk that could represent the whole text semantically? And Yes, these are the major Questions that will be answered via this article.

Why do we require Text Summarization?

These are some use cases where automatic text summarization can be used across the enterprise:

1. Media monitoring

The problem of information overload and “content shock” has been widely discussed. Automatic summarization presents an opportunity to condense the continuous torrent of information into smaller pieces of information.

2. Newsletters

Many weekly newsletters take the form of an introduction followed by a curated selection of relevant articles. Summarization would allow organizations to further enrich newsletters with a stream of summaries (versus a list of links), which can be a particularly convenient format in mobile.

3. Search marketing and SEO

When evaluating search queries for SEO, it is critical to have a well-rounded understanding of what your competitors are talking about in their content. This has become particularly important since Google updated its algorithm and shifted focus towards topical authority (versus keywords). Multi-document summarization can be a powerful tool to quickly analyze dozens of search results, understand shared themes and skim the most important points.

4. Internal document workflow

Large companies are constantly producing internal knowledge, which frequently gets stored and under-used in databases as unstructured data. These companies should embrace tools that let them re-use already existing knowledge. Summarization can enable analysts to quickly understand everything the company has already done in a given subject, and quickly assemble reports that incorporate different points of view.

5. Standardizing Metadata

In any company/startup a tool is required to automate tasks like creating table name recommendations, column name recommendations & many more metadata elements, this is where keyword extraction could come handy.

What does top words means?

From the input provided in textual format the top words are the ones that can represent the whole text semantically (meaning). Let’s take an example:

Let say this is input text for which we require the top words:

Kalam was elected as the 11th president of India in 2002 with the support of both the ruling Bharatiya Janata Party and the then-opposition Indian National Congress. Widely referred to as the “People’s President”, he returned to his civilian life of education, writing and public service after a single term. He was a recipient of several prestigious awards, including the Bharat Ratna, India’s highest civilian honor.

Clearly the above text is talking about India’s 11th President A.P.J. Abdul Kalam. Let see what words can be used to represent the whole paragraph:

Kalam

11th president of India 2002

Bhartiya Janta Party

Indian National Congress

People’s President

Bharat Ratna

So, for any piece of text there are few words that translates the whole meaning behind the give corpus, as from above example we can observe that only by reading the words anyone can figure out what author/writer is talking about like in our case it would be Mr. Kalam the 11th President of India who was associated with either BJP & Congress and received Bharat Ratna.

We saw example but lets see that is there any existing technology that could help us to provide summarization as above?

How can we extract top words from Textual Data?

In the field of Natural Language Processing major amount of advancements are observed over time. One major technique/library that could be used to gather top words from Textual Data is YAKE (Yet Another Keyword Extraction Algorithm), there are many other python libraries as well like RAKE, Gensim, KeyBERT etc., but we will see and try to understand YAKE in this article.

Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake! does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where the access to training corpora is either limited or restricted.

YAKE is an unsupervised Approach for Automatic Keyword Extraction using Text Features.

Main features :

  • Unsupervised approach
  • Corpus-Independent
  • Domain and Language Independent
  • Single-Document

Usage (Python)

Lets see how to use it in python:

Before implementing install Yake Package

pip install yake

Run below code to get required output

Output after above code

How does it work?

The proposed system has 6 main components:

(1) Text pre-processing

First, we apply a pre-processing step which splits the text into individual terms whenever an empty space or a special character (e.g., line breaks, brackets, comma, period, etc.) delimiter is found.

(2) Feature extraction

Second, we devise a set of five features to capture the characteristics of each individual term. These are: (A) Casing; (B) Word Positional; (C) Word Frequency; (D) Word Relatedness to Context; and (E) Word DifSentence.

The first one, Casing, reflects the casing aspect of a word.

Word Positional values more those words occurring at the beginning of a document based on the assumption that relevant keywords often tend to concentrate more at the beginning of a document.

Word Frequency indicates the frequency of the word, scoring more those words that occur more often.

The fourth feature, Word Relatedness to Context, computes the number of different terms that occur to the left (resp. right) side of the candidate word. The more the number of different terms that co-occur with the candidate word (on both sides), the more meaningless the candidate word is likely to be. Finally, Word DifSentence quantifies how often a candidate word appears within different sentences. Similar to Word Frequency, Word DifSentence808 R. Campos et al. values more those words that often occur in different sentences. Both features however, are combined with Word Relatedness to Context, meaning that the more they occur in different sentences the better, as long as they do not occur frequently with different words on the right or left side (which would resemble a behavior close to the one of stop words).

(3) Individual terms score

In the third step, we heuristically combine all these features into a single measure such that each term is assigned a score S(W).This weight will feed the process of generating keywords which is to be taken in the fourth step. Here, we consider a sliding window of 3-grams, thus generating a contiguous sequence of 1, 2 and 3-gram candidate keywords. Each candidate keyword will then be assigned a final S(kw) ,such that the smaller the score the more meaningful the keyword will be. Equation below formalizes this:

S(kw) score of a candidate keyword

(4) Candidate keywords list generation

(5) Data Deduplication

(6) Ranking

--

--

Aditya Mishra

Dynamic, Dedicated and Determined. A tech enthusiast, working in the field of Data Science and Machine learning.