TextNet: Community Detection and Topic modelling

The Rise and fall of Topic Modelling :

Published in

Brillio Data Science

6 min readSep 20, 2022

With the increase in data digitization across various domains there is plethora of text content available to uncover insights and patterns. It is often challenging to extract the hidden knowledge and relations from these unstructured data.

Topic models is one of the most earliest and widely used techniques to identify relationship between documents based on the topics hidden (latent variables) within the text data.

Some of the variants of topic modelling and the drawbacks in existing approaches are explained below

NMF: NMF stands for Non-Negative Matrix Factorization , is a form of Topic Modelling —that recur through a corpus of documents. A corpus is composed of a set of topics embedded in its documents. A document is composed of a hierarchy of topics. A topic is composed of a hierarchy of terms.

NMF aims at reducing the dimension of corpus by decomposing the document-term matrix into two smaller matrices — the document-topic matrix (U) and the topic-term matrix (W). It assumes that the original input is made of a set of hidden variables and this is represented by each column of W matrix and all the three matrices have no negative elements.

*Source:* *https://blog.acolyer.org/2019/02/18/the-why-and-how-of-nonnegative-matrix-factorization/*

Drawbacks:

NMF fails to consider the geometric structure in the data and works well only with short text data

LDA: Latent Dirichlet Algorithm

LDA is a generative model (bag-of-words model) where we calculate word-topic probability and topic-document probability. These probabilities are calculated iteratively for each word in the document.

Source:https://images.app.goo.gl/KbwdUf8okjGg8LNw8

Drawbacks:

#Identify the number of topics prior to modelling .

#Lack of correlation between topics.

#Dynamic topic allocation for each run.

#Inability to properly choose the number of topics

NETWORK ANALYTICS:

A network refers to a structure representing a group of objects and corresponding relationships between them as nodes and edges.

NODES: Nodes are objects of interest that needs to analyzed, this includes peoples, categories , geographies etc.

EDGES: Edges represent the relationship between the chosen nodes

One of the prevalent examples across network analytics is social network analytics example, Facebook users are considered to nodes and edges would be their relationship between the users.

Some of important centrality measures in the network analytics are described below:

Degree centrality: Degree centrality of the node is the number of edges

Eigenvector centrality: Eigenvector centrality scores the nodes based on the connection to other important nodes in the network.

Betweenness centrality: This centrality measure is based on the number of shortest path that passes through the vertex

Closeness centrality: Closeness centrality indicates closeness of node to other peer nodes in the network. It is calculated as the average of the shortest path length from the node to every other node in the network.

COMMUNITY DETECTION

We impart an additional layer of intelligence to of the problem of identifying topics by relating it to the problem of finding communities in complex networks.

Community detection involves dividing a network into groups of nodes that are similar in any specific features. They are usually groups of vertices having higher probability of being connected to each other than to members of other groups, though other patterns are possible. Examples of this task include identify hidden communities involved in a money laundering network.

Some of the key algorithms used for this task are Kernighan-Lin algorithms, Spectral Clustering, Label propagation, Modularity Optimization, etc.

COMBINED APPROACH

In this post, we will learn how we could use network analytics and community detection to leverage topic modelling. This article suggests a two step approach with topic modeling and network analytics to represent texts as a network and discover communities within this network.
We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods , we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents).

IMPLEMENTATION

!pip install networkx
!pip install TextNet
!pip install keybert

The data used to model this approach is customer complaints data with features such as Product ,Sub-product , Issue ,Sub-issue .

##loading data##
# Data import
data = pd.read_csv(r"complaints.csv",index_col='Product')
data = data.sample(frac = 1)data = data.head(5000)

We are removing stop words, applying lemmatisation, and removing short words as part of data preprocessing. Preprocessed text is then passed to KeyBERT to extract the keywords from the Issues column

##data preprocessing
def clean_text(issue):
    le=WordNetLemmatizer()
    word_tokens=word_tokenize(issue)
    tokens=[le.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
    cleaned_text=" ".join(tokens)
    return cleaned_text
data['clean_issue']=data['Issue'].apply(clean_text)##corpus creation from extracted keywords##
from pathlib import Path
import textnets as tn
corpus = Corpus(data['clean_issue'])##keyword extraction##
import textnets as tn
from keybert import KeyBERT
tt = []
kw_model = KeyBERT()
kw = KeywordExtractor(
            lan="en",
            n=3,
            top=50,
            windowsSize=5
        )
tt = []
for label, doc in corpus.documents.iteritems():
    print(doc)
    for term, sig in kw_model.extract_keywords(doc,keyphrase_ngram_range=(3,3)):
        print(kw_model.extract_keywords(doc))
        tt.append({"label": label, "term": term, "term_weight": 1-sig, "n": 1})

{‘label’: ‘Debt collection’,
‘term’: ‘threatened contact share’,
‘term_weight’: 0.2017,
’n’: 1},
{‘label’: ‘Debt collection’,
‘term’: ‘share information improperly’,
‘term_weight’: 0.35650000000000004,
’n’: 1},
{‘label’: ‘Debt collection’,
‘term’: ‘contact share information’,
‘term_weight’: 0.44599999999999995,
’n’: 1},
{‘label’: ‘Credit reporting, credit repair services, or other personal consumer reports’,
‘term’: ‘problem credit reporting’,
‘term_weight’: 0.1401,
’n’: 1},
{‘label’: ‘Credit reporting, credit repair services, or other personal consumer reports’,
‘term’: ‘credit reporting company’,
‘term_weight’: 0.245,
’n’: 1},
{‘label’: ‘Credit reporting, credit repair services, or other personal consumer reports’,
‘term’: ‘reporting company investigation’,
‘term_weight’: 0.34850000000000003,
’n’: 1},
{‘label’: ‘Credit reporting, credit repair services, or other personal consumer reports’,
‘term’: ‘company investigation existing’,
‘term_weight’: 0.41400000000000003,
’n’: 1},
{‘label’: ‘Credit reporting, credit repair services, or other personal consumer reports’,
‘term’: ‘investigation existing problem’,
‘term_weight’: 0.4244,
’n’: 1}

The data is broken into keywords and then passed on to a network detection algorithm called TextNet to perform the following topics.

preparing texts for network analysis
creating text networks
detecting themes or “topics” within text networks
visualizing text networks

The community_multilevel function applies the Louvain community detection algorithm to automatically uses the edge weights and determines the number of clusters within a given network.

final_df=tn.corpus.TidyText(tt).set_index(“label”)corpus = Corpus(final_df[‘term’])
net = tn.Textnet(corpus.noun_phrases(normalize=True))terms = net.project(node_type="term")# initialize the random seed before running community detection
tn.init_seed()
part = terms.graph.community_multilevel(weights="weight")print("Modularity: ", terms.graph.modularity(part, weights="weight"))terms.plot(label_nodes=True, color_clusters=part)Modularity:  0.13316231137452245

We are visualizing the network in two different formats here where one network shows how the words in the issues are connected based upon their co-appearance and another network where all the products are connected by similar complaints.

network = net.project(node_type="doc")
network.plot(label_nodes=True)

Vertices with high betweenness may have considerable influence within a network by virtue of their control over information passing between others.

network.top_betweenness()

words.top_betweenness()

TextNet: Community Detection and Topic modelling

The Rise and fall of Topic Modelling :

Written by Suchithra