From Corpus to Multi-Label Classification

#1 Unexplored Corpus

5 min readSep 9, 2023

In this article we will focus on understanding the data on a high level and engaging with stakeholders early on in our effort to extract value from a new data set.

In my experience, the biggest challenge on such projects is aligning with stakeholders on the vision, requirements, and success criteria. This is easier if the project is part of a top-down mandate from leadership. However, if you are an enterprising NLP practitioner trying to address opportunities you’ve identified then you may have more work to do.

If you can include a reliable project manager in the project early on then do it! Either way, you’ll want to engage stakeholders early and often to work through initial uncertainties and misaligned expectations.

Engage with stakeholders

In addition to exploring the data on your own, it’s important to involve stakeholders early on in the process. The stakeholders likely own the underlying business process and are the subject matter experts making their input essential.

First, capture any missing details about the data. Much of this was likely documented as part of the business case but ensure you can answer the following:

Where does the data come from? Is it collected internally or from an external source?
What is the frequency of data collection? Is the data generated in batches or ad hoc?
Is the data preprocessed at all?
What is the nature of the text data? Is it structured (e.g., database records) or unstructured (e.g., social media posts)?
Are there any accompanying metadata or features associated with the data? (e.g., images)
How large is the dataset in terms of records and size (e.g., length of the unstructured text)?

Ask the stakeholders to document:

Their vision for how they will use the data and any related predictions.
Their preferred level of analysis. Do they want to operate at the document, page, paragraph, sentence, word, or other level? A viable default here is the sentence. Although they may be interested in multiple levels.
A wish list of labels they would like to be able to analyze over time.
Draft definitions for those labels. A sentence or two for each is sufficient.

The benefits of these steps are two-fold: first, you start outlining the multi-label classification task and second, you can assess how engaged the stakeholders are in this overall effort. If it seems like they are disengaged or if you find yourself providing most of the momentum for this initiative then have a direct conversation with them individually to ask if they have any questions, concerns, or conflicting priorities. Be solution-oriented here and only escalate if you or the project manager cannot re-engage them.

Get the corpus

Given our assumption of having a vast, multi-lingual corpus (e.g., survey responses, chat messages, product reviews), start by accessing the data and familiarizing yourself with it. Read through a few dozen records to get a feel for the content. Is the text formal/informal, long/short, is there any boilerplate content that appears to be a simple copy/paste? Manually translate records using Google Translate (or similar) at this point.

Translate the corpus

Decide on a target language for your corpus. Translating the corpus into a single language will simplify subsequent steps and may be necessary if your organization and/or stakeholders operate in only one language. For better or worse English is the de facto language of much of the internet, academic research and publications, and many companies at the forefront of ML/NLP. As a result, there are a lot of packages and tooling for English. Of course, there are packages and tooling for Chinese, Spanish, German, Japanese, etc., and if you want/need to carry out your work in one of those languages and the tools exist then do it!

For detecting and translating text that doesn’t match your target language, several options are available. You can explore free, open-source solutions like EasyNLP (refer to my notebook Translate Amazon Reviews.ipynb), langid in conjunction with LibreTranslate, or utilize a pretrained model capable of translation such as T5 (EasyNLP is essentially a wrapper for this). Alternatively, you can opt for a paid service like Google’s Translation API.

If your target language is English you could then collect descriptive statistics of a sample of the data using textstat including sentence counts, reading times, and various readability scores.

Cluster the corpus

With your corpus ready, the next step is to cluster and explore the data to find some initial, meaningful patterns in it. This will provide you and the stakeholders with a ‘bottom-up’ perspective on the data.

I’ve found BERTopic to be useful here. In the words of its creator Maarten Grootendorst it is a

topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic is easy to use, powerful, and modular. You have an abundance of options available through the sub-models. You don’t have to preprocess your corpus except translate it to your target language unless you’re OK working with tokens from more than one language (to work with a multilingual corpus set language=“multilingual” which uses the paraphrase-multilingual-MiniLM-L12-v2 embedding that was trained on text from over 50 languages).

Before diving into your corpus with BERTopic, familiarize yourself with its basics. Martin’s introductory article and the two Google Colab notebooks below serve as excellent starting points:

With a solid understanding of the BERTopic basics, apply it to your corpus. Experiment and identify a set of parameters that result in output that aligns with your perspective on the data. Next, capture the items below to initialize our bottom-up perspective on the data.

the top C clusters
the top K keywords in each
a visualization of the relative weight of the top K keywords in the top C clusters (topic_model.visualize_barchart())
R representative examples for each (topic_model.get_representative_docs())

Select values for C, K, R, etc. based on your corpus and scenario (i.e., higher values for a large corpus and/or a complex scenario).

Parting Recommendations

Adopt an experimental approach, iterating to identify what best suits your needs.
Compile and consistently update a list of stopwords to refine your clustering. While spaCy offers a solid foundation, remember to include any company or product-specific terms that might skew results.
Initially, let BERTopic automatically decide the number of topics with nr_topics= “auto”. With time, you’ll gain an intuitive understanding of the optimal number for your dataset, and stakeholder input can further refine this.
Familiarize yourself with topic_model.update_topics(). It’s a handy feature that lets you tweak topics without fully refitting the model.