LDA Topic Modeling in Spark MLlib

Published in

Zero Gravity Labs

6 min readJul 14, 2017

In the past few months at Zero Gravity Labs, we have been exploring a variety of text mining methods in the context of dialogue understanding. For these experiments, we obtained and exploited hundreds of thousands of chat transcripts collected in our call center.

Latent Dirichlet Allocation (LDA) is a widely used generative statistical model that is able to discover the main topics discussed in a collection of documents. In one of our recent experiments, we studied the effectiveness and efficiency of using LDA on chat data in detecting and tracking the evolution of topics and themes over time. This blog post highlights some of what we learned throughout this investigation by explaining the basics of LDA and providing an example of LDA implementation using Spark MLlib in Scala. For privacy and security reasons, instead of the chat data, the example provided is based on publicly available tweets collected via Twitter API.

What is LDA?

LDA is an unsupervised method that models documents and topics based on Dirichlet distribution, wherein each document is considered to be a distribution over various topics and each topic is modeled as a distribution over words. Therefore, given a collection of documents, LDA outputs a set of topics, with each topic being associated with a set of words. To model the distributions, LDA also requires the number of topics (often denoted by k) as an input. For instance, the following are the topics extracted from a random set of tweets from Canadian users where k = 3:

Topic 1: great, day, happy, weekend, tonight, positive experiences
Topic 2: food, wine, beer, lunch, delicious, dining
Topic 3: home, real estate, house, tips, mortgage, real estate

In addition to k, LDA can be provided with some domain knowledge by specifying two concentration parameters called α and β. A low α value places more weight on having each document being composed of only a few dominant topics, whereas a high value of α places more weight on each document being composed of a relatively larger set of topics. Similarly, a low β value places more weight on having each topic composed of only a few dominant words. Figure 1 shows an overview of LDA inputs and output.

LDA is an iterative algorithm. In the first iteration of LDA, every word in each document is given a random topic from the set of k topics. Then, then the distribution of topics over documents and the distribution of words across topics are calculated. In the following iteration, the algorithm makes use of these current distributions and attempts to re-assign the topics to words by optimizing a set of parameters until the algorithm converges. To learn more about the mathematical details of LDA, this session of a machine learning course from the University of Colorado can be a great starting point. To get deeper in the theoretical background of LDA, the original paper by Blei, Ng, and Jordan can be the best resource. Finally, this article written by Asuncion et al. provides an overview of the optimization methods used for LDA including Expectation-Maximization (EM), which is the default implementation in Spark.

LDA Applications

In its training phase, LDA generates a list of themes and topics from a collection of documents, which can be directly used for analytical purposes. Besides, when this list is already available, one can locate a new document in the topic space (see Figure 2). For instance, given a new Twitter timeline, we will be able to infer that 50% of the tweets are about Topic 2 (food), while 50% of the tweets are about Topic 3 (real estate). By locating the documents in the space, it is also possible to understand how similar the documents are in regards to their topic of focus. For instance, we will be able to find other Twitter users that follow the same distribution of topics. In addition, LDA can be used to understand how topics evolve and shift in one document or across multiple documents over time.

Implementation in Spark

Spark MlLib offers out-of-the-box support for LDA (since Spark 1.3.0), which is built upon Spark GraphX. LDA implementation in Spark takes a collection of documents as vectors of word counts. Therefore, the first step is to load and pre-process your documents into the required format. The following snippets to pre-process the data are developed based on this blog post from Databricks.

To load the data into an RDD of String:

val sparkSession = SparkSession.builder()

.appName("LDA topic modeling") .master("local[*]").getOrCreate() val corpus: RDD[String] = sparkSession.read.json(readPath).rdd

.map(row => row.getAs("text").toString)

The following snippet tokenizes and cleans the text to only include words with more than three letters that are not the in the stopWords set:

val stopWords: Set[String] = sparkSession.read.text("stopwords.txt")

.collect().map(row => row.getString(0)).toSet

val vocabArray: Array[String] = corpus.map(_.toLowerCase.split("\\s"))

.map(_.filter(_.length > 3)

.filter(_.forall(java.lang.Character.isLetter)))

.flatMap(_.map(_ -> 1L)) .reduceByKey(_ + _) .filter(word => !stopWords.contains(word._1))

.map(_._1).collect()

Finally, the following code completes the process of converting the documents into a vector of word counts:

val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap

// Convert documents into term count vectors val documents: RDD[(Long, Vector)] =

tokenized.zipWithIndex.map { case (tokens, id) =>

val counts = new mutable.HashMap[Int, Double]() tokens.foreach { term =>

if (vocab.contains(term)) {

val idx = vocab(term) counts(idx) = counts.getOrElse(idx, 0.0) + 1.0

}

} (id, Vectors.sparse(vocab.size, counts.toSeq))

}

Once the data is prepared, LDA model can be built by specifying the following parameters:

k: The number of topics (default k = 10)
α: Dirichlet parameter for prior over documents’ distributions over topics. If your documents are made up of a few dominant topics, choose a low α.
β: Dirichlet parameter for prior over topics’ distributions over words. If your topics are made up of a few dominant words, choose a low β.
maxIterations: Limit on the number of iterations (default i = 20)
setOptimizer: The optimizer used to perform the calculation (the default value is em for EM optimizer and online for online variational Bayes LDA algorithm. Note that the em is the default).

The following builds a simple LDA model that is expected to generate three topics after running 100 iterations:

val lda = new LDA().setK(3).setMaxIterations(100) val ldaMode = lda.run(documents)

LDA model in Spark supports the following two methods:

describeTopics: Returns topics as arrays of most important terms and term weights
topicsMatrix: Returns a vocabSize by k matrix where each column is a topic

For instance, describeTopics can be used as follows to iterate through and print the terms in each topic, along with their corresponding weights:

ldaModel.describeTopics(maxTermsPerTopic = 10).foreach { case (terms, termWeights) =>

terms.zip(termWeights).foreach { case (term, weight) =>

println(f"${vocabArray(term.toInt)} [$weight%1.2f]\t")

} println()

}

So far, we pre-processed the documents and ran an LDA model. Being an iterative optimization algorithm, LDA can be computationally intensive. Thus, spark allows saving and loading the trained topics for future use:

ldaModel.save(sparkSession.sparkContext, "path") val sameModel = DistributedLDAModel.load("path")

The model loaded as a DistributedLDAModel supports a variety of other methods for the analysis of documents with respect to the trained set of topics. For instance, it enables the calculation of the top documents for each topic (topDocumentsPerTopic) or it can return the top k weighted topics for a document (topicAssignments). We can also see the distribution of topics for each document (topicDistributions). Details on the LDA model in Spark can be found in Spark API docs.

We are pleased to be able to share our work and are excited by the possibilities. We would love your feedback and to connect with others interested in this work, please get in touch. We are also looking for some other amazing Scientists to join our team so please check out our open roles.

Tarneh Khazaei is a Scientist at Zero Gravity Labs.

Connect with Taraneh or any member of our team (we are hiring) by following us on Facebook, Twitter, Instagram and LinkedIn.

LDA Topic Modeling in Spark MLlib

What is LDA?

LDA Applications

Implementation in Spark

Written by Zero Gravity Labs