Exploring Different Keyword Extractors — Statistical Approaches
Exploring Different Keyword Extractors is an ongoing series which contains a total of three blogs. This blog is the first in this series. It provides an introduction to Keyword Extraction and why it is important. I also go into the details of three Statistical approaches for Keyword Extraction. The second blog will cover four graph-based Approaches for Keyword Extraction and the third one will cover different Evaluation Metrics and a comparison of different statistical and Graph Based approaches.
The pace with which the data and the information generated has been growing, makes summarizing it a challenge. According to Netcraft’s January 2020 Web Server Survey, there are over 1 billion websites today increasing with a pace of around 380 new websites being created every minute. Millions of people, either through blog posts, new articles, comments, forum posts, or social media publications contribute to this increasing size of the internet today.
This huge amount of data and information is completely unstructured, that is it doesn’t come with any annotated and descriptive text. This makes summarizing the information at hand a particularly important and challenging problem. Today, every business runs on data. All the decisions made are driven by the data. For example, if a business owner knows what are those things that are important to the customers based on their reviews, then such an insight could facilitate data driven business strategies. Keyword Extraction is one such tool that can help extract keywords or keyphrases from a huge set of data (For example, Customer Reviews) in just seconds. The extracted keywords or keyphrases help in gaining the insights into the kind of topics being discussed in the given huge set of data.
There are different techniques that can be employed to automatically extract relevant keywords from a piece of text. These techniques can be statistical which can be as simple as counting the frequency of different words appearing in the text or pretty complex such as many Graph Based approaches. This is the first blog in a series of blogs that I will publish over Keyword Extraction and in this blog we shall cover three statistical approaches.
In statistical approaches, the basic idea is to find the score of the terms present in the document using different types of statistics calculated over a single document or across several documents. At inference time, once we have the scores, we order all the terms based on their scores and display the top n terms as important keywords. Different approaches employ different ways of calculating scores for n-grams as well.
Term-Frequency (TF) is one of the simplest methods. It captures the phrases or words that appear most frequently in the document. It could be really useful if one wants to capture the most recurrent themes or topics in the document. However, it considers a document as a mere bag of words and doesn’t factor in the structure and semantics present in the document. For example, it would not differentiate between synonyms. Also there could be certain words such as Stopwords that frequently occur not only in the given document but also in all the other documents. Therefore we need something that could penalize such words which is exactly what TF-IDF does.
TF-IDF is one of the simplest and an effective method of extracting Keywords. Here each term t in a document d is assigned a score based on two things. First is the term t’s frequency in the document. This is also known as Term-Frequency (TF) covered previously. Second is the Inverse Document Frequency (IDF) which is based on how many other documents include the term t. IDF is what penalizes words that frequently occur in both the given document and the entire corpus.
TF-IDF is defined as:
Here, D denotes the total number of documents and Dt denotes the number of documents containing t
In order to extract keywords using this method, we first calculate the TF-IDF scores of each token present in the document. Training phase involves the calculation of the IDF scores for all the tokens present in the training corpus. At inference time, the TF is calculated and then multiplied by its IDF. For this reason, TF-IDF requires a training corpus unlike the other statistical approaches.
We then extract all the longest n-grams consisting of the given token. The score for each of these n-grams is calculated by summing the TF-IDF scores of the individual tokens present in these n-grams. Finally all the n-grams are sorted based on their TF-IDF scores and the top K n-grams are returned as the top K most relevant keywords or keyphrases.
YAKE (Yet Another Keyword Extractor)
YAKE is a light weight unsupervised keyword extraction approach. It heavily relies on statistical text features which are selected and computed from a single document. This is where it sets itself apart from the TF-IDF approach and does not require any dictionaries or any external corpora. As mentioned above, TF-IDF approach requires a corpus in order to calculate the IDF scores for each term present in the corpus.
For the same reasons YAKE is also language and domain independent. Apart from not depending on an external corpora, YAKE doesn’t depend on any external resources (WordNet or Wikipedia) or any Linguistic Tools (NER or POS tagger) other than a static list of stop words. The other interesting thing about YAKE is that it is Term Frequency independent. This means that no conditions are set regarding the minimum frequency or sentence frequency that a possible keyword must have. What this entails is that a keyword might be considered significant or insignificant with either one occurrence or with multiple occurrences.
The algorithm for YAKE contains 5 steps:
- Text Pre-processing and candidate term identification
- Feature Extraction
- Computing Term Score
- N-gram generation and computing candidate keyword score
- Data deduplication and Ranking
1. Text Pre-processing and candidate term identification
Here the first step is sentence segmentation. The entire document is broken into sentences. The authors of YAKE used segtok which is a rule-based sentence segmenter. Each sentence is then divided into chunks whenever a punctuation is found and each chunk is split into tokens. Each token is then converted to lowercase and tagged with appropriate delimiters as shown below:
The example below shows a sentence being split into three chunks with all of it’s tokens tagged with appropriate delimiters:
Each of these terms are considered as a candidate unigram terms.
2. Feature Extraction
The very first step here is to calculate some statistics for each of the annotated unigram terms. These statistics are Term Frequency (TF), Index of sentences where the term occurs (offset_sentences), Term Frequency of acronyms (TF_a) and Term Frequency of uppercase terms (TF_U). To capture the co-occurrences between a word and its neighbors found within a window of w, a Co-occurrence Matrix (co-occur) is also calculated. All of these statistics are then used to extract 5 features which are:
- TCase (Casing)
- TPos (Term Position)
- TFNorm (Term Frequency Normalization)
- TRel (Term Related To Context)
- TSent (Term Different Sentence)
2.1 TCase (Casing)
The basic intuition behind this feature is that an Uppercase term (not considering the first word in a sentence) would be more important than a lower case term. Acronyms have all their letters as capital and therefore are also considered. However, instead of counting Uppercase terms and Acronyms twice, maximum of the two is considered. Therefore to calculate this feature the Term Frequency (TF), Term Frequency of acronyms (TF_a), i.e., the number of times the given term was marked as an acronym and Term Frequency of uppercase terms (TF_U) i.e., the number of times the given term was marked as an Uppercase term are used in the following way:
2.2 TPos (Term Position)
The intuition here is that an important keyword would usually occur at the beginning of the document, while the words placed in the middle or in the end of a document won’t be as important. Although many statistical keyword extraction approaches use this feature, YAKE computes this feature in a slightly different manner.
YAKE doesn’t directly use the word’s position rather it utilizes the position of the sentence in which the word occurs. The idea is to value the words in sentences that occur in the beginning higher than the words in sentences that occur towards the end of the document. The equation below shows how this weight is calculated.
Consider the example shown in the figure below. Let us look at two terms “service” and “science” that occur at 7th, 8th, 13th and 16th sentences and 1st, 2nd, and 8th sentences respectively. TPosition(service) would be 0.95 (ln(ln(3+Median[7,8, 13,16]))) while TPosition(science) would be 0.47 (ln(ln(3 + Median[1, 2, 8]))).
This gives us values that increase smoothly depending on how far towards the end the words are placed. Lower the value, higher up in the document the word would be placed and therefore would be considered more important.
2.3 TFNorm (Term Frequency Normalization)
This feature captures the frequency of a word in a document. In order to avoid inflicting any bias towards high frequency in documents that are long, the TF of a word is normalized using the mean of the frequencies and their standard deviation is used as a smoothing parameter. The mean is calculated over the TF of all the non-stopwords present in the document.
2.4 TRel (Term Related To Context)
This feature aims at capturing and quantifying the significance of a word based on its context. The idea is that the higher the number of different terms co-occurring with a word, the lesser is the significance of the given word. In other words, this feature tries to capture the dispersion of a word with respect to its context. The DL or DR is defined as:
Here, |A t,w| represents the number of different words within a predefined window of size w. To penalize the words that occur with very high frequencies, the DL and DR are multiplied by the term frequency of the word divided by maximum term frequency (MaxTF) among all candidate words that occur in the document.
2.5 TSentence (Term Different Sentence)
This feature factors in the assumption that a word that appears in many different terms have a higher chance of being important. This computed as follows:
Here, SF(t) is the sentence frequency of the word t and #Sentence is the total number of sentences.
3. Computing Term Score
Once all the features are extracted the final term score is calculated as:
The idea behind normalizing TFNorm and TSentence with TRel is to assign a high value to words that appear frequently and in many sentences only when they are relevant with a low TRel value.
Lower the overall score, higher the importance of the term.
4. n-gram generation and computing candidate keyword score
To generate n-grams, a sliding window of size n is considered to generate a contiguous sequence of words (unigrams to ngrams). An ngram can only be formed if the contiguous sequence of words not just belong to the same sentence but also belong to the same chunk. An additional check is also placed to ensure that no ngram starts or ends with a stopword. Stopwords can appear in the ngram just no at the beginning or end.
The final score of the candidate keyword is given by:
kw represents a candidate keyphrase and S(kw) is its final score. The numerator is the product of the scores of unigrams present in the phrase. Since lower the score, higher is the importance of that phrase, this might favor longer keyphrases. To account for that the denominator term considers summing the individual scores and weighs it with the Keyword’s frequency (KF(kw)). The final list of keywords are then displayed in order by ascending S(kw)..
5. Data deduplication and ranking
In this step all the keywords found in the previous step are compared with each other to avoid having any redundant or similar keywords. A distance similarity measure is used and when similar keywords are found the one with the lowest final score is retained and rest are removed.
To understand the importance of each of the five features used in the algorithm, the authors conducted an Ablation Study following a backward-like elimination process. This means that one by one, each individual feature is removed from the Single Term Weight S(t). This is done by considering a value of zero for the features being summed and a value of one for the features being multiplied.
Here are the results of this Ablation Study:
- Removing frequency related features such as KF(kw) and TFNorm had the most sever negative impact on the results when compared to the baseline with all the features.
- Removing TCase and TPos features worsened the results to a lesser extent as compared to removing the frequency related features.
- Removing TRel and TSent features seem to improve the results in most cases.
Overall the authors had concluded that KF(kw) and TFNorm are the most important features for the keyword assignment.
In this article, we covered what is Keyword Extraction and why it is important. We looked at three statistical approaches. Term Frequency being the simplest of all, contains a severe drawback of unintelligently focusing on the frequency of the terms. TF-IDF alleviated this issue by incorporating the Inverse Document Frequency to intelligently score the terms not just based on how frequently they occur in a single document but also in the entire corpus. YAKE is a more recent addition to the statistical keyword extractors and provides complex feature engineering in order to score different keywords. In the next article we will cover the following Four Graph Based approaches:
About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.