A Talk on Contextualized Attention Embeddings by Professor Shoaib Jameel

Thupten Dukpa
Thomson Reuters Labs
3 min readFeb 23, 2023
Image source: NLP.png (1024×768) (relopezbriega.github.io)

Shortly after joining TR Labs Bangalore as a Machine Learning intern, I had the wonderful opportunity to attend a research talk given by Dr. Shoaib Jameel, a senior lecturer at the University of Southampton, on the 12th of January 2023 here in Bangalore. But before this, we had a brief but pleasant informal chat where we exchanged introductions and got to know each other well over a cup of tea.

The talk itself focused on finding a definitive answer as to how pre-trained language models (like BERT) and probabilistic language models (like LDA) form similar kinds of topical word-clusters even when the former is not explicitly defined to model latent topics.

PLMs and PTMs

Firstly, Prof. Jameel introduced two different language models such as Pre-trained Language Models (PLMs) including ELMo, GPT, PaLM, Bidirectional Encoder Representations from Transformers (BERT) and Probabilistic Topic Models (PTMs) such as the Latent Dirichlet Allocation (LDA).

Initially pre-trained on huge amounts of data, PLMs are expensive and take up large computational resources. However, these models are reliable to handle downstream applications such as document classification, text summarization and information retrieval. In BERT model, the lower and middle layers store the semantic and syntactic properties of text respectively while the top layers contain the contextualized information.

On the other hand, an example of a PTM is LDA which is an unsupervised model that determines latent topics for a document where each topic contains a mixture of words using a probabilistic approach. In LDA, topics are represented as a probability distribution of words while in PLMs, word clusters are obtained by clustering token-level embeddings. There are some similarities between both clusters, the question lies in how the PLM vectors contain latent topic information.

Probe Tasks

As per previous studies, different layers of BERT model capture different properties of text (as stated above) and do not model latent topical word clusters. Thus, the focus shifts to attention heads in PLM.

The attention head assigns attention weights for all word pairs and a high value of attention weight for a pair means a high topical correlation between those words. Similarly, in LDA, words corresponding to a particular topic will be assigned a higher probability. For example, (sport, football) will have high attention weight in PLM as well as high probability in PTM.

In the first probe, the coherence score is calculated after word-level clustering on the results from PLM and PTM. In PLMs, the clustering is done using attention vectors as features. Now, semantically related words are grouped together in one cluster. If the coherence values are numerically comparable to each other, then it implies that both models are learning semantically related content. Secondly, a high coherence value between the clusters also implies overlapping of words between the clusters.

The PLM models used are the BERT and Distil BERT models while LDA and Non-negative Matrix Factorization (NMF, a linear algebra-based model) are the PTM models used. The datasets used are the 20 Newsgroup and IMDB datasets containing about 18,000 documents in 20 categories and 50,000 movie reviews, with positive or negative labels, respectively. The clustering algorithm used is the Gaussian Mixture Models (GMM) algorithm.

Conclusion

  • The attention mechanism in BERT and DistilBERT is mainly responsible for the similar results of PLMs and PTMs.
  • The contextualized layers of PLMs have the most overlapping words with the clusters from PTMs.
  • Latent topics implicitly encoded in the PLM attention vectors result in better performance that PTMs in information retrieval tasks.

It was an immense pleasure and privilege to have been audience to this talk as it gave me a keen insight into Prof. Jameel’s research and its relevance in our work. I now begin to understand the importance of the confluence of academia and industry. It also helped me learn and improve my knowledge in this domain. As a zealous academician myself, I look forward to attending more of such talks in the future and delve deeper into this field.

References

Link to the paper: 2301.04339.pdf (arxiv.org)

Further references:

Understanding Topic Coherence Measures | by João Pedro | Towards Data Science

Gaussian Mixture Models Clustering — Explained | Kaggle

--

--

Thupten Dukpa
Thomson Reuters Labs

Undergrad at IIT Hyderabad | Ex-Machine Learning Intern at Thomson Reuters Labs - Bangalore