DS in the Real World

Occam’s Razor Meets Content Classification: Matchmaking in NLP Heaven

Research Team @ sumup.ai

7 min readFeb 15, 2020

This post deals with document classification, where the goal is to determine the category a given document belongs to. There are several industry applications of such capability, one being the often-mentioned automated content moderation on social media platforms. Software error logs’ classification is another important application. So is the ‘primary’, ‘social network’ and ‘promotions’ classification provided within Gmail and appearing as separate tabs in the user mailbox. Here, we explore how simpler, transparent and robust methods can compete with state-of-the-art blackbox classification methods.

Data sets. In order to compare classification algorithms, a labeled dataset is required, ideally with multiple classes to gain more confidence in the performance of a given algorithm, and ideally one where prior work has been conducted, thereby establishing a baseline used for comparison purposes.

Open and labeled social-media datasets with published classification work are rare and subject to participation bias where some of the participants come in with pre-trained models benefiting from supplemental data instead of training on the sample that is provided (see the discussion in section 5 of https://arxiv.org/abs/1903.08983 for instance).

The Enron dataset, available here, contains emails from several key employees at Enron, assigned to email folders created by those employees. It is a good candidate for our exercise and is the subject of a recently published blog by Cortical.io. Within this dataset, we follow the methodology described in the aforementioned blog to derive the subset of the data on which classifiers are evaluated. We arrive at the following set of mailboxes and folders within those mailboxes:

For each mailbox, we retain 50% of each folder as training data, 25% of each folder as validation data, and the remaining 25% as testing data. The actual document subsets belonging to each of training, validation and testing data are derived using random sampling, part of a standard cross-validation exercise.

Each folder contains anywhere from a hundred to several thousand emails. The classification task is run on an owner-by-owner basis: classifying among all ‘kaminski-v’ folders, then classifying among all ‘farmer-d’ folders, and the same with all ‘lokay-m’ folders. As a result, the ownership of a given email is known prior to running a classification task. This is a key aspect for the second iteration of our experiment, which is detailed further down this blog.

Evaluation metrics used in the classification field usually are Accuracy, Precision, Recall and F1 score. Different classification applications rely on a different metric to optimize and evaluate against. For instance, content moderation puts a strong emphasis on Recall / F1 metrics, firstly because there is a high cost to missing a particular post which should have been moderated, due to the nature of its content, and secondly because such posts are a small fraction of the total content on a daily basis. In the case of an intelligent mailbox, Accuracy may be a more relevant metric as both the data and the objective function aren’t so asymmetric.

Results. We have added one model to the comparison presented in Cortical’s blog, for a total of six different methods:

· Word2vec: a simple word-embedding combined with a linear classifier.

· Doc2vec: the document-level version of word2vec, combined with a linear classifier.

· FastText: language model and classifier developed by Facebook, relying on a Neural Network architecture.

· BERT: language model developed by Google, relying on a BiLSTM architecture. It can be used as a classifier.

· Cortical.io: an email-only proprietary classifier developed by Cortical.io, leveraging their Semantic Folding technology.

· Nucleus: refers to the binary classifier and transparent language-features selection in the proprietary platform developed and commercialized by sumup.ai. Unlike conventional classifiers, Nucleus takes a parsimonious approach in terms of the number of features that are used (that is a parameter exposed to the user) and allows users to check the features that have been retained for sensibility and possibly to further fine tune through optional custom stopwords.

For reference on Nucleus, we report summary metrics on the number of features that have been retained to categorize emails among the folders within each mailbox.

Summary statistics on the number of classifying features retained across all folders within each of the three Enron mailboxes, using the method Nucleus.

It is a noteworthy degree of parsimony relative to the other approaches, which use the whole features set (~19,000 features for ‘lokay-m’ and ‘farmer-d’ and ~46,000 features for ‘kaminski-v’) and compress it with language models projecting the features set onto a smaller space of factors (typically a few hundreds). These factors are the embedded representation of each word according to a language model and do not have a transparent interpretation.

The table below reports the classification accuracy for each method on each of the three mailboxes taken from the Enron dataset.

Average performance (multi-class accuracy) for six classification methods evaluated on three Enron mailboxes. Nucleus calculations by SumUp, other calculations by Cortical.

The transparent and parsimonious method Nucleus is on par with the sophisticated blackbox approaches and edges them out in terms of robustness, on that dataset.

We perform a second iteration with Nucleus that leverages the optional custom stopwords list, to provide a concrete illustration of its benefits to a user looking to embed Nucleus within their own workflow. The dataset being composed of emails with minimal processing, we provide the list of names and work aliases of each of the 3 Enron employees whose mailboxes are used, as stopwords:

The aliases are composed of a last name, a location code, a division code, and a possible additional handle found in Enron emails’ header. For example: Vince Kaminski/HOU/ECT@ECT.

‘HOU’ designates the office location of the employee (here Houston), and ECT the division to which an employee belongs (here Enron Capital and Trade Resources). In representing those features, the punctuations and symbols are omitted.

We add to that list the standard email boilerplate:

Upon running Nucleus again, there is a significant decrease in the number of features that have been retained to categorize emails among the folders within each mailbox. 10 features are sufficient for all folders but two, which need 30 features, a notable reduction in model complexity while still retaining full interpretability.

The following table reports the classification accuracy of each method on each of the three mailboxes taken from the Enron dataset, with the additional iteration on the method Nucleus.

Average performance (multi-class accuracy) for six classification methods evaluated on three Enron mailboxes, where the use of stopwords in the method Nucleus is illustrated. Nucleus calculations by SumUp, other calculations by Cortical.

The following table reports the cross-validated standard error of the performance metrics generated by the method Nucleus, using 50 draws of the triplet (training, validation, testing).

Cross-validated standard error of the classification accuracy with the method Nucleus on three Enron mailboxes.

This comparative study isn’t a conclusion on the superiority of one document classification approach vs another, as much as it is an illustration that higher model complexity does not ensure higher model performance. It also provides some insights into the benefits of a transparent and flexible approach in enabling a user to adapt to their specific context without involving heavy compute and man-hour resources.

Worthy of note is that the alternative approaches already make use of a language model in order to explicitly capture and associate closely semantically related words while this is an area of on-going research to incorporate within Nucleus.

There will be a part-2 to this post upon completion of this research. In this second part, we will also expand on the evaluation of the other aforementioned classifiers by estimating confidence intervals for the performance metrics, and by comparing them using other datasets as well.

Let us know of your thoughts and questions in the comments!

References:

Sun, Chi, et al. “How to fine-tune BERT for text classification?.” China National Conference on Chinese Computational Linguistics. Springer, Cham, 2019.

Adhikari, A., Ram, A., Tang, R., & Lin, J. (2019). Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398.

Yu, S., Su, J., & Luo, D. (2019). Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge. IEEE Access, 7, 176600–176612.

Chang, W. C., Yu, H. F., Zhong, K., Yang, Y., & Dhillon, I. (2019). X-BERT: eXtreme Multi-label Text Classification with BERT. arXiv preprint arXiv:1905.02331.

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

DS in the Real World

Occam’s Razor Meets Content Classification: Matchmaking in NLP Heaven

Research Team @ sumup.ai

Written by SumUp Analytics