Text classification on Long document with Preprocessing

Long documents bore readers and confuse algorithms. Our ContextAggregator can help with the latter.

kavender.wang

Published in

CBI Engineering

8 min readMar 3, 2022

Introduction

News articles that contain valuable signals for market intelligence and research purposes. At CB Insights, signals such as business relationships and funding information are primary data sources for our data science projects. As the data scientists at CB Insights, we are responsible for extracting insights from these articles to build product features that help our clients make the right business decisions.

Transformers are the state-of-the-art tool for tasks like text classification, question-answering, and named entity recognition. With pre-trained models, the time required for data collection and labeling has been significantly reduced.

But, it turns out overly long documents present problems not just for us, humans, but for NLP algorithms as well.

Models are typically pre-trained with a shorter sequence of characters (eg: <160 characters for Yahoo Answers or IMDb Reviews) as input. When fine-tuned for applications where the input sequence is longer (hundreds or thousands of characters), the model struggles to understand what it should pay attention to.

Standard transformers set a limit on the max sequence length (e.g. 512 characters) because of the quadratic complexity Ο(n^2) in the self-attention layer. In addition, the max sequence length also applies to the learned positional embeddings. It is unlikely to infer useful information from an input which is longer than what has been seen during training..

In this post, I’d like to introduce how we leverage signal-based preprocessing to better extract useful information from documents that are longer than the preset max sequence length. As a result, we achieve better precision and recall and improve the speed of inference during prediction.

Motivation

Unlike famous text classification benchmark datasets with short texts (< 160 words, e.g. Yahoo Answers or IMDb Reviews[1]), the articles we are dealing with at CB Insights have thousands of words. The transformer tokenizer of DataLoader would either truncate long documents to the max sequence length or pad short content. While this works fine for text classification based on the whole article(e.g. sentiment analysis), classifying a long document covering multiple topics in each section is more challenging. Especially when there are multiple topics appearing in a single long document, the fine-tuned model is not able to focus on the designated topic, leading to high misclassification rates.

In practice, we fine-tuned RoBERTa Model [2] on 2,000 curated labeled data to build our model. The data was marked positive if the given article mentioned service provider relations. The vital signal to identify usually states as an organization provides services like consulting, auditing , underwriter and legal to other organizations. Such information is usually briefly mentioned when companies release press information about acquisition, fund-raising and legal issues.

With a relatively small and highly skewed label distribution, we reached a 0.83 F1 score on our test set. But the model performance could not be further improved even if we apply sampling methods to solve label imbalance issues.

Luckily, the error analysis reveals the source of the problem: about 1/4th of true positives in our test data contained overly long content and received a very low predicted probability. Relevant signals are truncated when they appear in the sentences longer than the max sequence length. It’s impossible to improve fine-tuning performance with missing content during training and inference.

Figure 1: A press release for TerraPay with a fundraising announcement. The sentence with service provider information is at the end of the article, above the 512 token limit.

Context aggregation to the rescue

The core concept of our approach is straightforward: feeding the identified key sentences from a lengthy text to our pre-train model to make the attention mechanism gather sufficient information. We also use a preprocessor to extract contextual information around the key sentences to help the classifier make better decisions. Our approach improves the accuracy of text classification model training without exploiting all information from the input document.

ContextAggregator identifies key sentences and selects their contextual sentences as sentence blocks in the descending order of the key sentence weights. The contextual sentences are picked around a key sentence within the radius of the step size. All sentence blocks with aggregated length lower than max sequence length are sorted based on their original relative order.

Identify key sentences with relevant signal

To capture the relational information, we need to identify keywords and the relations of words with both phrase matcher and dependency matcher from spaCy v3, respectively.

By using DependencyMatcher, we add a list of dictionaries to represent all relation patterns. Each dictionary describes a token with its attribute and relations to other existing tokens. We then check the existence of key phrases by using PhraseMatcher and the subjects of all the matched sentences. Since this process adds more confidence to identify key sentences with relevant signals, we called them strong matches.

Even though the matcher implementation is used to extract unique signals with service provider info, the concept of identifying signal-based context is totally generalizable to other use cases.

2. Implementation details of ContextAggregator.

Key sentence is defined as a sentence with relevant signal
Scan through all content sentence to locate key sentences and all sentence length
Loop through all key sentences by descending order of signal strength
While key sentences plus surrounding context < max sequence length:
a) Identify surrounding sentences within N step-size before/after key sentence
b) Sort contextual sentence by its closeness to key sentence and length
c) Break if all qualified contextual sentence has been visite
Sort aggregated sentences and update metadata of service provider signals found in aggregated context

For news with short content, i.e., original content is less than max sequence length, context aggregator step is skipped. For the sake of data augmentation, we consider reordering the raw content by sentence signals. It’s not necessary for classification purposes.

When the number of sentences in an article is large, along with scattered occurrences of key sentences, this process can take some time. In our dataset, it took ~10 sec to process one document with 100+ sentences on average.

3. Model evaluation

Our results confirm that with ContextAggregator can be used for long sequences with a sparkle of vital signals at any position and speed up the fine-tuning procedure.

We also explored the following options and their impact on the model performance:

i. Tweak predicted probability cutoff

Default cutoff from a binary classifier is 0.5. To increase the prediction precision, we try to lift it to a much higher threshold, e.g. 0.85 or 0.9.ii. Use rule-based signals to change predicted label

For any article with key sentences worth highlighting, we allow precision sacrifices to gain additional recall boost.

However, it turns out the fine-tuned classifier is super confident on the positive class with ContextAggregator to surface hit signals context. Adding rule-based sentence selection can increase error rates when the semantic meaning of matches differs from the ground truth. In our experiment, we found that the model’s accuracy has no further improvement with probability cutoff tweaks or post-processing rules.

On the other hand, parallel processing on the first scan of key sentences across extremely long documents (>1000 sentences) does help reduce the processing time.

Figure 2: Service Provider Pipeline Demonstrated

Rule-based only: Rely on the service provider pattern(strong) matches on both title and content to label as positive and treat no signal as negative.
Rule as post-processing: Use the positive service provider pattern matches to relabel some negative as the positive class.
Relaxed rules: Either sentence has any pattern (strong) matches or key phrase matches(weaker).

Related work

It is not memory-efficient to capture the sentence-level or a higher-level feature directly. A natural workaround is to slice long text into small pieces before aggregating the processed embeddings. The RoBERTa and ToBERT[3] papers describe how to slice input text into smaller pieces and get an average of segment-wise prediction with mean-pooling, max-pooling, or applying LSTM or another transformer. However, the performance improvement is not noticeable due to the incapability of assigning correct labels per smaller chunk of text or the lack of long-distance attention.

Another long sequence model is the Longformer[4] from Allen AI Institute. It doesn’t need to cut the input sequence to avoid memory overflow. The local attention with dilation makes every token only pay attention to tokens in the vicinity defined by window size w, ½ w either to the left or right. Although the long documents can be processed without information loss, the non-task-specific truncation could still confuse the base model when task-specific architecture is required to handle input text.

Reformer[5] model claims to push the sequence length limit to half a million tokens at once with a re-engineered architecture to optimize for minimum memory requirements. However, it’s not GPU friendly and still needs verification for BERT usage.

Final thoughts

As demonstrated, converting long documents into a shorter context and focusing on high-signal sentences adds lift to the classification model fine-tuned on our dataset. It’s better than the general text sliding approach or hierarchical attention networks[3], especially when the training data is small.

In this post, we covered the practical aspect of context aggregation as a preprocessing step for fine-tuning the RoBERTa model on our text classification task. Using this approach, we can process really long texts to allow the predictor to capture different locations of ground truth. Our approach outperforms the fine-tuned RoBERTa baseline and Longformer both in the cross-validation and test sets, especially when extremely long text is fed as input.

When text classification is highly impacted by high-signal sentences and when the input length matters, it’s better to ensure the model could learn about the pattern from the context surrounding them. With the key sentences being highlighted and aggregated, we also gain a better interpretation of the predicted results.

Are you ready to beat us on our benchmark? Challenge accepted. Please feel free to reach out to us with your thoughts. We’d like to hear from you.

Besides, if you’re eager to apply your software engineering or data science skills with CB Insights data, please check out our current openings and join our teams.