Probabilistic topic models in contextual advertising

Armin Catovic
Schibsted engineering
12 min readFeb 17, 2022
Woman staring at posters on a wall.
Photo by Jo San Diego on Unsplash

As the online advertising world prepares for the impending “third-party cookie death”, advertisers and publishers alike are looking into alternative forms of advertising revenue. One viable alternative, which does not rely on behavioural data, is contextual advertising. However, the key challenge in contextual advertising, is facilitating a good match between an advertiser’s audience requirements, and a publisher’s inventory. In this document, we discuss our approach, and lessons learned, in solving this problem using probabilistic topic models, and more specifically, using Latent Dirichlet Allocation (LDA). The LDA approach works very nicely, after some minor tweaking, and our topic models today are an integral part of how we deliver contextual ads.

Introduction

Advertising, by definition, is a means of communication between a brand (also referred to as a “seller”, or an “advertiser”), and potential customers. The seller’s aim is to define and target the most relevant customers for their products or services. Online advertising makes heavy use of cookies, and particularly third-party cookies, in order to target the most relevant users possible. Over the last couple of years however, major web browsers have signalled their intent to phase out support for third-party cookies [1]. Advertisers and publishers are therefore looking at alternative and privacy-preserving ways of maintaining effective online advertising channels. One such approach is contextual advertising.

Contextual advertising is a form of targeted advertising where the content of an ad segment is directly correlated with the content of a news article or a web site. An ad segment is an abstraction that encapsulates advertiser’s message and target audience. For example, an article about the European Union proposing a ban on new petrol and diesel cars, could be a perfect place for an ad segment related to a specific brand of electric vehicles. The aim of the advertiser is to maximize the reach among its target audience. When we say “reach”, we mean that the advertisement appears on as many relevant articles as possible.

One basic approach to creating a contextual ad segment, is by constructing a comprehensive list of keywords, which are then matched against publisher’s inventory. Often a partial set of keywords (or “seed words”) is prepared in advance by the advertiser. However, due to nuances in language use, i.e. synonymy and polysemy, and due to over-representation of certain topics and themes, these keywords alone may not map optimally, thereby limiting target audience reach.

To alleviate this problem, we train topic models on our Norwegian and Swedish news inventory, in order to uncover latent thematic structures, and use this to help our product specialists find a more optimal mapping between ad segments and news content. We use Latent Dirichlet Allocation (LDA) [2] and more specifically Griffiths and Steyvers’ adaptation of Gibbs sampling for topic discovery [3]. The resultant topic models serve two objectives:

  • To guide our product specialists via keyword expansion — given n ≥ 1 seed words, the model infers the highest probability topics, and for each topic, suggests the highest probability keywords that can be added to the ad segment
  • To provide topic insights — our product specialists can quickly view a thematic snapshot of our inventory, and the aggregate user engagement across each theme or topic

In the subsequent sections we introduce LDA, describe our process of integration and application of LDA topic models, and discuss lessons learned.

Latent Dirichlet Allocation (LDA)

LDA [2] is a simple, yet fully consistent Bayesian model that describes the (somewhat simplified) generative story behind a text corpus. It assumes that each document, may be associated with one or more topics, and the words that appear in that document reflect the particular set of topics it addresses [3]. We say it is somewhat simplified because it makes a bag of words (BoW) assumption, i.e. that the order of words in each document is irrelevant to the underlying thematic structure. In LDA, we treat each topic as a probability distribution over words, viewing a document as a probabilistic mixture of these topics [3]. This can be more visually presented in Figure 1.

Example of a Swedish news article and its Latent Dirichlet Allocation representation.
Figure 1 — Example of a Swedish news article and its LDA representation. The article is a sparse mixture of topics θ — in this case we can clearly see that out of (potentially) a very large number of topics T, the three topics with the highest probabilities are economy, climate change, and politics. Each topic Φ, is a distribution over a fixed vocabulary, with higher probabilities assigned to specific words; for the climate change topic, the highest probability words are klimat (climate), skog (forest) and fossil (fossil). Each observed word w, contains a specific topic assignment, z.

More formally, LDA is a Bayesian network, described using a directed acyclic graph (DAG) and “plate” notation. In contrast to other document clustering approaches, i.e. taking word frequency vectors (or document embeddings) and applying K-Means clustering, LDA’s generative nature lends itself to simple interpretation of how a document came to be. As Figure 2 shows, this generative story is as follows:

  • We first imagine we have T topics — in our case we select T by manually evaluating the model interpretability/coherence (see Evaluation), however there are also nonparametric Bayesian approaches where T can be chosen automatically during inference [4]
  • Each topic t ∈ {1, 2, …, T}, is a Multinomial distribution Φ_t over the entire vocabulary, sampled from a Dirichlet prior with a hyperparameter β
  • We then imagine that we have a text corpus consisting of D documents
  • Each document d {1, 2, …, D}, is a Multinomial distribution Θ_d over all possible topics, sampled from a Dirichlet prior with a hyperparameter α
  • For each word i {1, 2, …, N} in document d…
  • …we choose a topic assignment z_i, from a Multinomial distribution Θ_d…
  • …and we choose a word w_i, from a Multinomial distribution z_i
A directed acyclic graph representing our LDA model.
Figure 2 — A directed acyclic graph representing our LDA model. Each node is a random variable- shaded/grayed out nodes represent the data we actually observe, i.e. words, while hollow nodesrepresent latent variables we need to infer/estimate; arrows indicate conditional dependence; plates (rectangles) indicate replicated variables.

Method

We train separate topic models for Norwegian and Swedish. In both cases we use the identical pipeline for pre-processing and topic model inference. We train the models once per month, in order to capture the overall thematic “drift” in the news corpora, such as the evolution of the COVID-19 pandemic, changes in the global economic outlook, transient events such as the Winter Olympic Games, and so on. We use a look-back period for the last six months. In the following sub-sections, we describe our topic modelling pipeline in more detail.

Pre-processing and Feature Representation

The text pre-processing simply consists of lower-casing, tokenization, and filtering. A token is any contiguous sequence of characters consisting of at least two characters, beginning with an alphanumerical character, followed by alphanumerals as well as hyphens and pluses. The exact regex used to extract tokens from a string is as follows: r”[0–9A-ZÅÄÖÆØa-zåäöæø][0–9A-ZÅÄÖÆØa-zåäöæø\-\+]+”. We only consider uni-gram tokens, i.e. single words (NOTE: through our linguistic analysis of contextual segment keywords, we found that ~80% are composed of single words and these are predominantly nouns; the rest are bi-grams, generally pertaining to named entities such as “Manchester United” — we leave part-of-speech tagging and named entity recognition for future work). Filtering, consists of removing tokens based on a set of stop words, followed by token removal based on power-law distribution of word frequencies [5]. We use separate sets of stop words for Norwegian and Swedish, where each set consists of approximately 500 words. Stop words are mainly composed of adverbs (e.g. inte/not), subjunctions (att/to), pronouns (hon/she), prepositions (av/of), particles (ut/out), conjunctions (och/and), interrogative words (vilken/which), cardinal numbers (tre/three) and ordinal numbers (tredje/third). We also include additional stop words that our product specialists consider contextually ambiguous — words such as kvinna (woman), mål (goal), skott (shot), etc. We then apply Zipf law and remove all tokens that have less-than the minimum document frequency as defined by 2 * (0.02 * |D|)^(1 / log10), where |D| corresponds to the number of documents in the corpus — in our case, this corresponds to approximately 50,000 articles in each language. At the end of the pre-processing pipeline, we are left with a vocabulary of approximately N = 10,000 words. Our feature representation for each document/article is then simply a sparse vector of word counts, of length N.

Inference

Our aim is to estimate the posterior distribution of topics, proportions, and assignments, given the data (words in the articles). More specifically, we wish to infer p(φ, θ, z | w). As is the case with most real-world Bayesian models, this estimation turns out to be intractable. We therefore leverage approximation methods. A common approach to estimating topic models is using Variational Bayes (VB) [2], and in particular, using online VB [6]. However, there are several other inference approaches, including expectation propagation [7], Markov Chain Monte Carlo (MCMC) [3], as well as more recent techniques leveraging continuous word representations (embeddings), such as Gaussian LDA [8], and Embedding Topic Models (ETM) [9].

After evaluating all of these approaches, we found that MCMC methods using Gibbs sampling, and in particular MALLET [10] implementation of the Gibbs LDA sampler, provide the most interpretable topics, both in terms of automated coherence metrics, as well as based on qualitative human evaluation (see Evaluation). In our approach, we use Gensim Python library [11], which provides a Python wrapper for MALLET (which itself is written in Java). We use the default initial hyperparameters in MALLET, which implies symmetric Dirichlet priors, and we set the number of sampling iterations to 1200, with hyperparameter optimization run every 100 sampling iterations. After careful evaluation, we fix the number of topics to T=150.

Evaluation

Automated topic model evaluation is still an open area of research. Hoyle et al presented their findings at the most recent NeurIPS 2021 conference, indicating that the state-of-the-art neural topic modelling approaches, are actually not much better (if at all), compared to the “old” MCMC based LDA methods of yesteryear, when human evaluation is taken into account [12]. Nevertheless, some objective, quantitative guide is needed, at least in the initial modelling stages. In our case, we adopt a Normalized Pointwise Mutual Information (NPMI) coherence metric [13], as implemented within the Gensim library. NPMI is defined as follows:

Equation for Normalized Pointwise Mutual Information.

NPMI achieves a higher (better) score, if the top N words — summed over all pairs of words w_i and w_j — have a high joint probability P(w_j,w_i), compared to their marginal probabilities P(w_i) and P(w_j). For NPMI evaluation, we use a small reference corpus of 1000 randomly sampled articles, and we use a context window of top-10 words. We evaluate NPMI metric by varying the number of topics T, from 100 to 1000, in steps of 50.

However, as pointed out by Hoyle et al, automated metrics alone are insufficient, and so we also perform a series of qualitative evaluations together with our product specialists. The aim here is to evaluate both the interpretability of topics, as well as the specificity of topic keywords. From advertising perspective, specific keywords such as premier league are more desirable than more general keywords such as ball, since the latter is contextually ambiguous, i.e. it is difficult to discern which sport we are referring to. A small sample of our topics is shown in Table 1.

Example topics (top-10 words) from our Swedish topic model.
Table 1 — Example topics (top-10 words) from our Swedish topic model; from top-to-bottom: the spread of omicron virus; Swedish cross-country team at the Winter Olympics; geopolitical tensions related to Russia and Ukraine; weather forecasts; real-estate.

Applications

Our trained topic models serve two purposes — as a way of suggesting additional keywords to be added during ad segment construction, and as a means of providing a quick snapshot of the trending themes in our news inventory. In the following sections we provide examples of such applications.

Keyword expansion

Keyword expansion (or suggestion) is a human-in-the-loop method, used to assist product specialists during ad segment creation. Given one or more keywords (entered manually by the product specialist), our model infers the most probable topic assignment, and provides the top-10 keywords currently not in use, within that topic. Figure 3 demonstrates keyword expansion in action. After entering the seed word spara (to save) — presumably in relation to a “money saving” ad campaign — the model infers a general money/economy related topic and its top keywords. The product specialist then enters aktie fonder (share funds), and the topic assignment now changes to a more specific stock market topic. The product specialist can then pick some of the suggested keywords, such as börs (stock market), aktie (stock) and ränta (interest rate), and be sure that her contextual segment will hit the most relevant content across our editorial inventory.

Keyword expansion demonstrates how the inferred topic assignments, and the corresponding highest probability keywords, evolve during segment creation.
Figure 3 — Keyword expansion demonstrates how the inferred topic assignments, and the corresponding highest probability keywords, evolve during segment creation.

Topic insights

Topic insights, as shown in Figure 4, are a feature where we provide the current snapshot of our editorial inventory, grouped according to their topic assignments. We also show the current engagement within each topic, in terms of millions of page-views. Since topic models were originally intended as a means of grouping large textual corpora based on their underlying thematic structure, our topic insights feature fulfills this purpose quite naturally. Product specialists can quickly gauge what our journalists are writing about, and how engaged the users are with their content. Previously, product specialists had to manually peruse news sites, to get a “feeling” for currently trending topics and themes.

Figure 4 — Current snapshot of topics across our editorial inventory. Unsurprisingly, the topic related to the spread of the omicron virus, leads in engagement metrics.

Conclusion and Lessons Learned

In today’s society, where privacy and integrity are some of the most pressing issues of our time, contextual advertising is taking a pole position within the cookie-less future. In the near-term, while the uptake for contextual ad campaigns is increasing, it still has a long way to go compared to more intrusive behavioural ad targeting. Applying machine learning, and in our case, probabilistic topic models such as the LDA, can significantly improve the performance of contextual advertising, and help it achieve new heights. Some of the key takeaways from our work presented here, can be summarized as follows:

  • Simple probabilistic topic models, such as the LDA, not only allow for rapid experimentation and prototyping, i.e. compared to more sophisticated Transformer based models, but also provide tough-to-beat baselines — giving you a fantastic springboard for more advanced modelling work.
  • Close collaboration with your users/stakeholders, and frequent human evaluations and usability tests, are a key to delivering machine learning based solutions with high utility.
  • Set aside time for discovery and validation — our motto is fail often, fail fast — machine learning has a notoriously high failure rate, and it takes time to land on a solution that works well.
  • Finally, many organizations fail to deliver machine learning solutions into production, due to ill-defined problems and unrealistic expectations. In our case, we didn’t attempt to fully automate segment creation; instead, we focused on human-in-the-loop aspects, where our goal is to simply empower our product specialists and make their job a little bit easier (coincidentally, Schibsted’s mission is “empowering people in their daily lives”).

References

[1] Chromium Blog, ‘Building a more private web: A path towards making third party cookies obsolete’, 2020. [Online]. Available: https://blog.chromium.org/2020/01/building-more-private-web-path-towards.html. [Accessed: 10 November 2021]

[2] D. M. Blei, A. Y. Ng, M. I. Jordan, ‘Latent Dirichlet Allocation’, \textit{Journal of Machine Learning Research}, vol. 3, July, 2003. [Online serial]. Available: https://jmlr.org/papers/volume3/blei03a/blei03a.pdf. [Accessed: 10 November 2021]

[3] T. L. Griffiths and M. Steyvers, ‘Finding scientific topics’, \textit{Proceedings of the National Academy of Sciences of the United States of America}, April, 2004. [Online]. Available: https://www.pnas.org/content/pnas/101/suppl\_1/5228.full.pdf. [Accessed: 10 November 2021]

[4] Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei, ‘Hierarchical Dirichlet Processes’, \textit{Journal of the American Statistical Association}, vol. 101, 2006, pp. 1566–1581

[5] G. K. Zipf, ‘Human Behaviour and the Principle of Least Effort’, \textit{Addison-Wesley}, 1949

[6] M. Hoffman, F. Bach, D. Blei, ‘Online Learning for Latent Dirichlet Allocation’, \textit{Advances in Neural Information Processing Systems (NIPS)}, no. 23, 2010

[7] T. Minka and J. Lafferty, ‘Expectation-propagation for the generative aspect model’, \textit{Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence}, 2002

[8] R. Das, M. Zaheer, C. Dyer, ‘Gaussian LDA for Topic Models with Word Embeddings’

[9] A. Dieng, F. Ruiz, D. Blei, ‘Topic modeling in embedding spaces’, \textit{Transactions of the Association for Computational Linguistics}, vol. 8, 2020, pp. 439–453

[10] A. K. McCallum, ‘MALLET: A Machine Learning for Language Toolkit’, 2002. [Online]. Available: http://mallet.cs.umass.edu. [Accessed: 10 November 2021]

[11] R. Rehruvrek and P. Sojka, ‘Software Framework for Topic Modelling with Large Corpora’, \textit{Proceedings of the LREC 2010 Workshop on New Challenges in NLP Frameworks}, May, 2010. [Online]. Available: https://is.muni.cz/publication/884893/en. [Accessed: 10 November 2021]

[12] A. Hoyle, P. Goel, A. Hian-Cheong, D. Peskov, J. Boyd-Graber, P. Resnik, ‘Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence’, Advances in Neural Information Processing Systems, vol. 34, 2021

[13] G. Bouma, ‘Normalized pointwise mutual information in collocation extraction’, \textit{Proceedings of the nternational conference of the German society for computational linguistics and language technology, GCSL}, 2009

PS: We’re hiring and have exciting positions in all our locations across the Nordics and Poland. Check out our open positions at https://schibsted.com/career/.

--

--