Practical Applications of Gensim in Data Science

Harshita Aswani
3 min readAug 21, 2023

--

Textual data is abundant in today’s digital world, and extracting meaningful insights from it is a challenging task. Gensim, a popular Python library, provides a powerful toolkit for topic modeling, natural language processing (NLP), and semantic analysis. In this blog post, we will explore the practical applications of Gensim and demonstrate its usage through code examples.

Topic Modeling with Gensim

Topic modeling is a technique used to discover hidden themes or topics within a collection of documents. Gensim offers efficient implementations of popular topic modeling algorithms, such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Let’s see an example of applying LDA for topic modeling:

from gensim import corpora, models

# Preprocess the documents and create a list of tokenized texts
documents = preprocess_documents()

# Create a dictionary mapping words to their numeric IDs
dictionary = corpora.Dictionary(documents)

# Create a bag-of-words representation of the documents
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

# Train the LDA model on the corpus
lda_model = models.LdaModel(bow_corpus, num_topics=5, id2word=dictionary, passes=10)

# Print the most representative topics
topics = lda_model.print_topics(num_topics=5, num_words=10)
for topic in topics:
print(topic)

In this example, we start by preprocessing the documents and creating a list of tokenized texts. We then create a dictionary using corpora.Dictionary() to map words to their numeric IDs. Next, we generate a bag-of-words representation of the documents using dictionary.doc2bow(). We train the LDA model on the corpus by passing the bag-of-words corpus, the number of topics to discover, the dictionary, and the number of passes. Finally, we print the most representative topics using lda_model.print_topics().

Word Embeddings with Gensim

Word embeddings are dense vector representations of words that capture semantic and syntactic information. Gensim provides an easy way to train and use word embeddings using algorithms like Word2Vec and FastText. Here’s an example of training Word2Vec embeddings:

from gensim.models import Word2Vec

# Preprocess and tokenize the sentences
sentences = preprocess_sentences()

# Train the Word2Vec model
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

# Get the vector representation of a word
word_vector = model['example']
print("Word Vector:", word_vector)

In this example, we preprocess and tokenize the sentences. We train the Word2Vec model using Word2Vec() by passing the tokenized sentences, specifying the vector size, window size, minimum word count, and the number of workers for parallel processing. Finally, we retrieve the vector representation of a word using model[word].

Text Similarity and Document Similarity with Gensim

Gensim provides functionality for computing similarities between texts and documents using techniques like cosine similarity. This can be useful for tasks like document retrieval, recommendation systems, and clustering. Let’s see an example of computing document similarities using TF-IDF:

from gensim import corpora, models, similarities

# Preprocess the documents and create tokenized texts
documents = preprocess_documents()

# Create a dictionary and corpus
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Train the TF-IDF model
tfidf_model = models.TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]

# Compute document similarities using cosine similarity
index = similarities.MatrixSimilarity(tfidf_corpus)

# Compute similarity between two documents
doc1 = preprocess_document("example_document_1")
doc1_bow = dictionary.doc2bow(doc1)
doc1_tfidf = tfidf_model[doc1_bow]

doc2 = preprocess_document("example_document_2")
doc2_bow = dictionary.doc2bow(doc2)
doc2_tfidf = tfidf_model[doc2_bow]

similarity = index[doc1_tfidf][doc2_tfidf]
print("Similarity between documents:", similarity)

In this example, we begin by preprocessing the documents and creating tokenized texts. We create a dictionary and corpus using corpora.Dictionary() and dictionary.doc2bow() respectively. Next, we train the TF-IDF model on the corpus using models.TfidfModel(), and transform the corpus into TF-IDF vectors using tfidf_model[].

To compute document similarities, we create a similarity index using similarities.MatrixSimilarity() and pass in the TF-IDF corpus. This index allows us to efficiently compute cosine similarity between documents. Finally, we compute the similarity between two specific documents by preprocessing them, creating their bag-of-words representations, transforming them into TF-IDF vectors, and accessing the similarity index.

Gensim’s similarity calculations can be extended to other metrics and algorithms, enabling you to explore various approaches for measuring text and document similarity based on your specific requirements.

Gensim is a powerful library that offers a wide range of capabilities for topic modeling, word embeddings, and text similarity analysis. In this blog post, we explored the practical applications of Gensim, focusing on topic modeling, word embeddings, and text/document similarity. By incorporating Gensim into your NLP projects, you can uncover hidden patterns, extract meaningful insights from textual data, and enhance the performance of various natural language processing tasks. Embrace the power of Gensim and leverage its tools to gain valuable knowledge from textual information.

Connect with author: https://linktr.ee/harshita_aswani

Reference:

--

--

Harshita Aswani
Harshita Aswani

Written by Harshita Aswani

Passionate about unlocking insights from data through advanced analytics. Constantly learning and leveraging technology to solve real-world problems.