What is my meeting about?

Deep dive into sentence embeddings and their use in keyphrase-extraction (Internship experience)

Daniel Skala
Slido developers blog
8 min readSep 29, 2021

--

Topics extracted from a Slido event

In Slido, there has been an ongoing attempt to analyze incoming questions and create a variety of features to improve the user experience. These include key-phrase extraction, sentiment analysis, questions similarity, and more.

The core of my work was focused on grouping (clustering) questions into smaller, more specific topics and finding a proper representative keyword for them. Take for instance the following set of questions:

  1. What are the currect covid restrictions?
  2. Did this meeting start already?
  3. How dangerous is the new coronavirus variant?
  4. How bad are the vaccine side-effects?
  5. I am so lost here…
  6. When can I get my 2nd dose of Pfizer?

We can divide the questions into the following topics:

Covid:

What are the currect covid restrictions?
How dangerous is the new coronavirus variant?

Vaccine:

How bad are the vaccine side-effects?
When can I get my 2nd dose of Pfizer?

Others:

Did this meeting start already?
I am so lost here…

These topics make sense because certain questions are about the same thing. Although every question is different, there is a fundamental similarity among some of them. To us humans, it is clear that questions about vaccineand 2nd dose of Pfizerare more closely related than questions about vaccineandcovid restrictions. My task was to design an algorithm that can solve this problem quickly and reliably.

First attempt

The project Questions-by-Topics arose in Slido’s internal hackathon where it won first place and was based on grouping questions by frequently occurring keywords. Although it gave satisfying results at the beginning, it was not able to group semantically similar phrases (eg. vaccine and 2nd dose of Pfizer). Moreover, the usage of the PageRank algorithm with cubic time complexity didn't play well scalability-wise.

Hence, my first task was to get rid of the slow PageRank algorithm and replace it with an asymptotically better solution.

Dendogram showing 4 clusters. Source: sklearn.datasets.load_digits

For this, I have used hierarchical clustering with average-linkage which was part of the TopicRank algorithm in the pke library. The idea is simple: merge every point in the dataset with its closest neighbor and declare them a cluster. Merge these with their nearest clusters and repeat until we end up in one massive cluster. The result is a hierarchy of clusters of various amounts and sizes. In the end, we pick a certain layer of this hierarchy and display it to the user.

But how do we know which level to pick? This needs to be done on a trial-and-error basis — just pick a level and look if the clusters make sense. That is, of course, not ideal and will be addressed later when talking about embeddings (and HDBSCAN). However, this is not our greatest concern. What if the questions change over time?

Clusters generated by the faster naive solution

Slido Events are dynamic

Imagine a very big meeting with thousands of participants. The event starts and the first questions are sent to Slido. After, say, 20 questions we should be able to split them into more specific topics. However, in the next 10 minutes, the number of questions grows to 100, and our topics change — they might have vanished, some arisen and others might have changed their priority/size. To visualize what is going on with the topics during a certain period of time, I have created a simulation tool that displays exactly these changes.

Simulation tool that depicts changing clusters

Red color means that a topic was lost, bold and underlined means a new cluster was born, #Qdenotes the number of questions in the event, and the signs <- and -> depict priority change. Whenever such a change occurs, a particular sentence that caused this change is displayed; including whether it was added or deleted. This tool gave us a better understanding of what the end-user might see so that we can avoid unwanted behavior and we can also use it to experiment with other clustering approaches. However, the question of how and when exactly we want to “recluster” the meeting and update present topics still needs to be explored further.

Embeddings

Clustering questions based on the lexical similarity of keywords (let’s call them candidates) might be sufficient, but we would like to do better and generate semantically similar groups. This, of course, brings a question of how does a computer know that vaccine and 2nd dose of Pfizershould be in the same cluster? One possible solution lies in using distributed representation of our candidates. We can create such representation using transformers.

A transformer is a deep learning model with the capability to, for example, generate sentence embeddings that help to solve problems that require a certain level of understanding of language. More precisely, it's a new type of neural network ‘designed to handle sequential input data, such as natural language, for tasks such as translation and text summarization’.

Okay, so how do we use it to generate sensible topics? The pipeline looks as follows:

Pipeline for semantic clustering
  1. We first extract all candidates and pass them through a language model of our choice. The result is a multi-dimensional feature vector (a long list of numbers) that ‘describes’ the candidate in a way that the transformer understands. In other words, we embed our candidates in a multi-dimensional vector space.
  2. But how do we make sense of numbers in, say, 384 dimensions? A standard approach in machine learning is dimensionality reduction. Reducing dimensions of the dataset results in data loss but important patterns (clusters) might emerge. With the use of the right algorithm, we can obtain sensible representation in lower dimensions. For more information, check UMAP.
  3. So how do we generate our clusters finally? And what is a cluster anyway? A cluster in our interpretation is a set of points that are significantly close to each other — the distances of their feature vectors are small. There is a number of clustering algorithms, ranging from the standard k-means, which merely partitions the dataset into spherical clusters, to more complex hierarchical density-based spatial clustering with noise, eg. HDBSCAN which I ended up studying and using. As of today, HDBSCAN is the most advanced clustering algorithm which extends on the popular DBSCAN by a cluster hierarchy and also automatically finds its optimal level (the problem we discussed above).
Clusters generated by HDBSCAN after UMAP down-projection into 2D

The “representative” problem

Good, we have our clusters. How do we find a keyword that best describes them? If we have a cluster with candidates: vaccine, vaccination, anti-vaxx, vaccines..., then we might want to try to pick our representative based on frequency or most occurring stem, in this case vaccine.

But what if we obtain a cluster with candidates: restrictions, economic crisis, covid, safety, disease, doctors... ? Is this cluster about a pandemic? Perhaps about Covid-19? Or health? All candidates are unique so frequency-based picking will not work. There are multiple descriptive keywords but how does a computer know what to pick when it is not clear even to us, humans?

This, unfortunately, remains an open problem… There are algorithms (eg. CBOW) that, given the context, return a fitting keyword and vice versa but this is a problem on its own and would probably require a bit more research.

Thankfully, the majority of clusters include candidates which at least share a common stem, and hence picking the representative keyword based on frequency covers most (but not all) cases.

The “most frequent shared substring” idea

To filter out clusters with abstract candidates or candidates with no repeating stem, we can compute the relative frequency (in %) of the representative in each candidate:

def compute_rel_frequency():
for candidate in candidate_list:
if representative.stem() in candidate:
frequency += 1
return frequency / len(candidate_list) * 100

Once we have the relative frequencies, we can set a threshold, say, 20%, and clusters with representatives whose relative frequency is below the threshold will not be displayed.

This is a nice combination of both approaches — we connect the frequency-guarantee property from the naive solution with the generalization property from embeddings to create clusters that are more general but do not deviate too much into abstraction.

Results

And that's it! Searching for ideal parameters and playing with the above-mentioned thresholds led to obtaining the desired clusters. From now on it was about merging data frames, binding embeddings to actual keyphrases, preparing data structures, and playing with heuristics like soft clustering (obtaining vector of probabilities of being in a cluster and only allowing those with high probability) or fixing random state to improve the results. So let's have a look at some:

Clusters obtained with different parameters

In these word clouds, we can observe that the number of neighbors (UMAP parameter), number of dimensions, and minimal cluster size, and other parameters will influence the precision and amount of our topics. Finding the best parameters is a tedious job and hence remains for future work (there might be multiple satisfactory clusterings).

DEMO

As a final part of my internship, I put this all together in a Streamlit application that allows users to load meeting data, tweak parameters, and look at the clusters, their representatives, and associated questions in real-time.

Streamlit demo

Future work

Researchers from the Tutte Institute (the creators of UMAP and HDBSCAN) are working on their vectorizers library where they provide interesting alternatives to BERT encoding, namely Information Weighted Embedding, Wasserstein Embedding, and more. It might be worth experimenting with these alternatives and see how they compare to the solution presented in this article.

Moreover, finding a descriptive representative for a set of key phrases also remains for future work since this problem does not seem to have a trivial solution.

Last words

I hope this project to be a useful contribution to Slido software and hopefully finds its way to production soon. Working in the Slido Data team was a great honor and I can't say enough good things about them! I was especially surprised by their friendly, ‘family-like, and open attitude. It was not all about the task itself; during my internship, I have learned to climb, be (a small) part of the ukulele band, drive over 1000km under 3 days for the VltavaRun and watch movies in the office till midnight — what a summer!

Finally, I would like to thank my supervisor Daniela Jašš for providing me with a great amount of support throughout the whole process and the Head of Data Team (and our guru) Marek Šuppa for leading us in the right direction.

If you would also like to work with these people in Slido, check out this link, and the next blog could be yours.

--

--