Latent Dirichlet allocation via Python

dp
2 min readNov 18, 2018

--

Simple LDA topic modeling example, based on Mallet.

Find a Corpus.

corpus = [    'You do not want to use them. They are fine for many machine learning tasks, just not deep learning.',    'It’s always a good idea to examine our data before we get started plotting.',    'The problem is supervised text classification problem.',    'Our goal is to investigate which supervised machine learning methods are best suited to solve it.']

Read this story to see how to use Bag of Words to decompose our corpus into an unique vocab list and a lookup which will contain the relationships that map each term in our documents to its placement within the unique vocab list. The vocab list will be referred as “vocab” and the mappings variable will be referred as “doc_2_term_dict” moving forward.

Setup our Global LDA Settings,

## GLOBALS,K = 3
TOPICS = [ i for i in range(K) ]
## fudge factors,
beta = .001
alpha = 1
n_iterations = 3## END GLOBALS,
  • k : the number of Topics.
  • beta : fudge factor for “term to topic”.
  • alpha : fudge factor for “document to topic”.
  • n_iterations : number of iterations to perform.

Setup LDA

Randomly set topics for each term for each document. The initial probability distribution (p) being used is uniform. This should spread the words uniformly across the topics.

Now, improve the Topic Assignments!

Mallet uses a technique called Gibbs Sampling. This example will do the same. The basic idea is that for each term in each document, remove its current topic assignment and resample.

Finally, running Gibbs Sampling multiple times.

Document to Topic Distribution:

How much of a document is allocated to a topic?

theta = ( (d2t + alpha).T / np.sum(d2t + alpha, axis=1) ).T            0              1              2
0 0.590909 0.227273 0.181818
1 0.294118 0.470588 0.235294
2 0.200000 0.400000 0.400000
3 0.315789 0.210526 0.473684

Topic to Term Distribution:

How “much of a term” is allocated to a Topic?

phi = ( (t2w + beta).T / (np.sum(t2w, axis=1) + beta) ).T            a           always          are           before
0
0.000045 0.045498 0.090950 0.000045
1 0.058879 0.000059 0.000059 0.058879
2 0.000059 0.000059 0.000059 0.000059

Top 8 words per topic:

[
(0, ['learning', 'are', 'is', 'you',
'we', 'deep', 'for', 'good']),
(1, ['problem', 'our', 'machine', 'they',
'started', 'not', 'it’s', 'it']),
(2, ['to', 'supervised', 'which', 'the',
'text', 'tasks', 'suited', 'investigate'])
]

Full Notebook can be found here.

References:

  1. https://vimeo.com/53080123

--

--