(2/2) Using NLP: Demystifying Indian Parliamentary Discourse

Visheshika Baheti
4 min readMar 24, 2024

--

Continuing from Part 1 of the series, in this article we will dive deeper into the use of Latent Dirichlet Allocation(LDA).

LDA is an unsupervised learning technique that uses a probabilistic approach to intelligently classify all words within or across documents into a topic (group of co-occuring words).

As an analogy:

Think of a cluster of balloons (words) floating around a room. Topics are like invisible boxes in the room. The balloons naturally gravitate towards the box (topic) with the strongest attractive force (highest probability). After everything settles, you can look at the balloons surrounding each box to understand the general theme of that box (topic). The balloons themselves aren’t labels for the boxes, but they give you clues about what’s inside.

Image from Microsoft AI Generator

Stages in practice:

Stage I. Random initialisation: For each document, it randomly assigns a set of k words from the document into a topic (k1, k2, k3, k4) — such as, production, train, sky, health — as you see, the words are not necessarily related to each other.

Stage II. Iterative Refinement: The core of LDA involves iteratively updating these topic assignments based on the following:

  • Word-Topic Probability: How likely is a word to belong to a particular topic considering its overall occurrence in documents?
  • Document-Topic Probability: How likely is a topic to be present in a particular document, considering the topics already assigned to other words within that document? This helps ensure thematic coherence within documents.
  • Co-occurrence Patterns: How frequently does a word co-occur with other words across the entire document collection? This helps identify thematic relationships between words.
  • With each iteration, the algorithm refines the topic assignments for each word, moving them towards topics that best reflect their thematic context.

Stage III. Building the Master List:

  • After the iterative process converges (meaning the topic assignments stabilize), a “master list” of topics is generated.
  • Each topic in this list represents a cluster of words that frequently co-occur across documents and are likely related to the same underlying theme. These themes emerge from the data itself, not from pre-defined categories.

Note: When working with Python modules for topic modeling, it's essential to explore the various options available to tailor the model to your specific needs. For instance, you might find that you don't want the topics to be determined solely by the entire corpus of documents (as done in Stage 3), as I didn't want that. By experimenting with different settings and parameters, you can fine-tune the topic modeling process to better suit your analysis goals and the characteristics of your dataset.

In theory:

Main assumptions:

  1. Each document is composed of a fixed number of topics (one or more) which is specified beforehand.
  2. The sequence/order of words is irrelevant i.e. for the topic assignment process, each document is a simple “bag of words” with a frequency distribution.

The figure below is the plate representation of the process:

Below is the description of the steps corresponding to the figure above:

  1. For all documents, α is a vector containing the weight of each topic in a document and β is a vector containing the weight of each word in a topic. Generally, α and β are pre-determined vectors with a single value (much less than 1) repeated. Values are less than 1 to ensure that the topic modeling algorithm biases towards a sparse distribution — so that, only relevant topics and words are included.
  2. LDA assumes a generative process for all M documents and K topics and follows the following steps:

— Choose Θi ∼Dir⁡(α) where i ∈ {1,…,M}

— Choose φk ∼Dir⁡(β) where k ∈ {1,…,K}

3. For each of the word positions i, j where i ∈ {1,…,M} and j ∈ {1,…,N}

— Choose a topic zij ∼Multinomial⁡(Θi)

— Choose a word wij ∼Multinomial⁡(φzij)

Packaging it up:

The results we get from the model are in the form:
Topic: [(‘word1’, 0.004516206), (‘word2’, 0.0037982047), (‘word3’, 0.0033419256), (‘word4’, 0.0031789257), (‘word5’, 0.0028883757)].

In practical terms, the output of an LDA model provides a structured representation of topics, each accompanied by a list of associated words and their probabilities. This output can be further processed and visualized to extract meaningful insights, such as creating word clouds or implementing search features based on word probabilities (example below).

By harnessing the power of probabilistic modeling and co-occurrence analysis, LDA enables us to navigate the vast landscape of textual data, uncovering hidden patterns and unlocking valuable knowledge within.

--

--