Keyword Extraction Strategies for Large Document Clusters

4 min readFeb 21, 2024

When dealing with a vast collection of lengthy documents, the utilization of keyword extraction techniques proves to be highly advantageous. By using this method, we can efficiently extract the top K keywords from each article, thereby expediting and enhancing the clustering process.

I've conducted an in-depth exploration of this subject in the video provided below:

To execute this task, we will utilize the Wikipedia dataset accessible in Hugging Face.

The dataset itself is substantial, totaling 72GB. Fortunately, it is categorized based on languages. For our purpose, we will exclusively download the articles labeled as English.

We’ll proceed by loading the dataset using the datasets library. If the library isn’t installed on your system, you can do so by executing the following command:

To perform this task, lets use the wikipedia dataset available in Huggingface.

The dataset is really huge — 72GB. Luckily, the dataset is tagged based on the languages. For this task, lets download only the articles tagged to English.

Lets load the dataset using datasets library. If the datasets is not installed in your system it can be done using the following command

pip install -U datasets

To load the dataset using the Hugging Face datasets library, you can use the following command:

from datasets import load_dataset
dataset = load_dataset(“wikimedia/wikipedia”, “20231101.en”)

The corpus comprises 6.4 million articles. We will now randomly select 10,000 articles from this corpus.

import random
num_rows = len(dataset['train'])
random_indices = random.sample(range(num_rows), 10000)
random_rows = [dataset['train'][idx] for idx in random_indices]

Each row contains id , title , text . We are interested only in the text column

articles = [x["text"] for x in random_rows

Preprocessing

Before performing keyword extraction, we will implement basic preprocessing techniques:

Removal of punctuations
Conversion of text to lowercase
Removal of stop words

To remove punctuations, we will iterate through each character and eliminate punctuation characters.

for i, article in enumerate(articles):
    # remove all the punctuations
    PUNCTUATION = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" 
    article = ''.join([c for c in article if c not in PUNCTUATION])

Use the nltk library to download the stop words

import nltk
nltk.download("stopwords")

Once stop words are downloaded, store them in a list


from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

Let’s exclude these words from our articles.

for i, article in enumerate(articles):
    
    # convert to lower and remove stop words
    article = ' '.join([word for word in article.lower().split() if word not in stop])
    
    articles[i] = article

Now, let’s dive into the core aspect where we will conduct keyword extraction.

To transform the articles into numerical representations, we will utilize the TF-IDF vectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5
)
articles_tfidf = vectorizer.fit_transform(articles)

articles_tfidf.shape

Setting max_df to 0.5 indicates that terms occurring in more than 50% of the documents are ignored.

Setting min_df to 5 specifies that terms occurring less than 5 times in the entire corpus are ignored.

After preprocessing , the article looks like this

After transformation, the corpus take the shape of (10000, 31280). Please note that the numbers may vary with each run due to the random selection of articles.

Providing such a sparse input to any clustering algorithm could slow down the process and prove to be inefficient. To address this challenge, we will perform keyword extraction. Instead of considering all words in the document, we will only use the top K words with higher TF-IDF scores. I’ll break down this process into 5 different steps to explain it clearly.

Step 1 : Generate a mapping of words to feature indices. In this step, each word in the vocabulary is linked with a feature index. To accomplish this, execute the following code:

feature_names = vectorizer.get_feature_names_out()
print(feature_names)

Randomly pick and index and see what word it is

print(feature_names)
'city'

Step 2: Retrieve the feature index for each article where the TF-IDF score is greater than 0.

for doc_index in tqdm(range(len(articles))):
    feature_index = articles_tfidf[doc_index,:].nonzero()[1]

This step will provide the feature index of all the words in the list along with their corresponding scores.

Step 3: Generate a mapping of feature indices with their respective scores. Our aim is to select the top K elements from this set. Sorting requires an associated mapping.

tfidf_scores = zip(feature_index, [articles_tfidf[doc_index, x] for x in feature_index])

Step 4: We now have the feature index as the key and the TF-IDF score as the value. Let’s sort it in descending order, where the higher feature score will be at the top.

sorted_tfidf_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)

Step 5: Now, let’s extract the top keywords from the previous step.

top_keywords = ' '.join([feature_names[i] for i, _ in sorted_tfidf_scores[:num_keywords]])

As we iterate through and perform this step for all the articles, let’s store the output in a list.

keywords.append(top_keywords)

Here’s the overall structure of the code based on the steps we discussed:

num_keywords = 50
keywords_per_document = []
feature_names = vectorizer.get_feature_names_out()
keywords = []

for doc_index in tqdm(range(len(articles))):
    feature_index = articles_tfidf[doc_index,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [articles_tfidf[doc_index, x] for x in feature_index])
    sorted_tfidf_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)
    top_keywords = ' '.join(feature_names[i] for i, _ in sorted_tfidf_scores[:num_keywords]])
    keywords.append(top_keywords)

We will associate these keywords with their respective original articles.

pd.DataFrame({"original_article" : articles, "top_keywords": keywords})

You can checkout the code from the following URL — https://github.com/srinathmkce/TheAIGuy/blob/main/NLP/clustering/Keyword%20Extraction.ipynb

Keyword Extraction Strategies for Large Document Clusters

Written by Shrinath Suresh