From Corpus to Multi-Label Classification

#2 Useful Labels

3 min readNov 18, 2023

In this article we will combine the bottom-up and top-down approaches to yield a useful set of labels that are actually in the data.

Working session with stakeholders

To begin the working session either you or the PM should provide a brief background on the initiative including facts about the dataset (the source, how records are created, the volume, any key data points), how the data is currently used within the organization, the value of automatically classifying it via a predictive model, and the high-level process for accomplishing this (which is essentially the articles in this series). Ask a participant who is not presenting to take notes for everyone’s reference after the session.

Following the background information, have the stakeholders present their

vision for how the organization intends to utilize the data including predictions generated by a model (e.g., to identify the top drivers of customer-product dissatisfaction)
preferred level of analysis (e.g., at the sentence level)
wish list of labels along with the draft definitions (see the article #1 Unexplored Corpus)

Ensure the stakeholders are well aware they’ll be asked to present, none of this should be a surprise to them during the working session.

Next, it is your turn to present on salient clusters found in the data, specifically the large and/or meaningful ones. Here you’ll share the clusters, top keywords, and representative examples you found in the data (again see #1 Unexplored Corpus). Highlight clusters that seem to align with labels in the stakeholders’ wish list and ones not on their list that may be of value to them. Use a whiteboard to draw the connections, you will end up with something like the following. The clusters may not align perfectly with one label which is fine for now. Keep in mind the clusters themselves are not 100% pure, they contain noisy and even irrelevant examples. We just want to look for relevant patterns we’ll utilize later on.

By the end of the working session we will have completed an exercise of combining the top-down and bottom-up perspectives on the data. We should have a set of labels that are of interest to the stakeholders, that they want to analyze over time and use to inform their decisions and actions. The stakeholders should be actively engaging in these activities and taking ownership of the labels and their definitions. This is important since these are concepts emanating from a business process they own either in part or in whole. Close the working session by summarizing what was agreed upon, namely the useful set of labels found in both the stakeholder’s wish list and the clusters from the data, recap any outstanding questions and next steps.

Cluster the corpus (again)

Return to the data to explore the clusters again with BERTopic. At this point, I recommend curating a list of lists of keywords to run a guided topic modeling. The lists of keywords should be based on the wish list labels provided by the stakeholders, the found clusters and their keywords, and other prominent clusters in the data (to help separate noise from useful clusters). The following BERTopic methods and functionality will be helpful.

get_topic() — returns a cluster’s top n keywords.

find_topics() — returns clusters most similar to a search term.

BERTopic(seed_topic_list=seed_topic_list) — guide the model to preserve the clusters indicated in the list of lists.

Once you’ve converged on a viable topic model through iterative experiments, you can rename the clusters to better align with the labels discussed with stakeholders. The objective is to develop an updated, authoritative topic model that reflects our explorations and discussions. We will annotate the data using the refined topic model, its clusters, and labels, a crucial step in training and evaluating a multi-label classification model.

From Corpus to Multi-Label Classification

#2 Useful Labels

Working session with stakeholders

Cluster the corpus (again)

Written by Bryan