NAACL ’19 Notes: Practical Insights for Natural Language Processing Applications — Part II
Continuing on the Part I of this blog post, we survey recent advances in some important NLP tasks, such as text similarity, text classification, sequence labeling, and language generation.
A NAACL ’19 paper “Correlation Coefficients and Semantic Textual Similarity” [code] questions the usage of cosine similarity in the word embedding space. The core idea is to consider a word or a sentence embedding as a sample of N observations of some scalar random variable, where N is the embedding size. Then, some classical statistical correlation measures can be applied for pairs of vectors. As their empirical analysis has shown, cosine similarity is equivalent to Pearson’s (linear) correlation coefficient for commonly used word embeddings (GloVe, FastText, word2vec). It comes from the fact that the values observed in practice are distributed around the zero mean. In the scenario of word similarity, a violation of the normality assumption makes cosine similarity especially inappropriate for GloVe vectors. For FastText and word2vec, the results of the Pearson coefficient and rank correlation coefficients (Spearman, Kendall) are comparable. However, the choice of cosine similarity is suboptimal for sentence vectors as centroids of word vectors (a widely used baseline for sentence representation), even for FastText. It is caused by stop word vectors behaving as outliers. The rank correlation measures are empirically preferable in this case.
“Rethinking Complex Neural Network Architectures for Document Classification” and its follow-up paper compare state-of-the-art (SOTA) document classification models on four accessible datasets (Reuters, Arxiv APD, IMDB, Yelp). The PyTorch implementations of the considered models are available in the authors’ framework Hedwig. While the fine-tuned BERT classifier shows the best results as expected, other findings are somewhat surprising. The second-best model is a simple bi-LSTM classifier, properly regularized and enhanced with max-pooling to get a document feature vector. It leaves behind some complex hierarchical architectures, such as hierarchical attention networks (HAN) or XML-CNN, questioning the need for such complexity for this task. Moreover, on the datasets (Reuters, Arxiv) with a large number of classes and relatively scarce examples, the last two models are outperformed even by standard one-vs-rest logistic regression and SVM trained on TF-IDF vectors.
“Mitigating Uncertainty in Document Classification” [code] proposes metric learning on feature representations and a dropout-based method to measure uncertainty of a deep learning model for text classification (with possible application in high-accuracy use cases, such as the medical domain). The classifier architecture is pretty standard: a convolutional neural network (CNN) over trainable word embeddings (initialized with GloVe vectors) followed by a dropout and a fully-connected layer and a softmax layer. Metric learning is used to train word embeddings in a way that minimizes the intra-class Euclidean distance and maximizes the inter-class Euclidean distance. Given Sₖ is a set of instances of k-th class, rᵢ, rⱼ — feature representations of i and j instances, D — Euclidean distance:
The incorporation of metric learning can diminish the prediction variance and increase the confidence of the accurate predictions.
The dropout-based method measures the model uncertainty in terms of the information entropy of multiple dropout evaluations combined with the denoising mask operations. The output vector of predicted classes y* = (y*₁,…,y*ₖ) is obtained by applying dropout after CNN k times (k = 100 in the experiments). The entropy of this class distribution (after masking 1/3 of the most underrepresented classes to reduce the noise) is calculated as an uncertainty score. We note that the variational dropout method still raises heated theoretical discussions (e.g., see this thread on Reddit; thanks to Grigory Sapunov for pointing this out). Nonetheless, the paper authors have shown that the approach boosted the macro-F1 score from 78% to 92% by assigning 25% of the labeling work to human experts in a 20-class text classification task.
“Ranking-Based Autoencoder for Extreme Multi-label Classification” proposes a novel principled approach to extreme multi-label text classification, i.e., multi-label text classification with a massive number of labels. This task has many real-world applications. For example, we at Orb Intelligence are doing NAICS industry classification (over 2,200 hierarchical classes) of company text descriptions. This task is also characterized by semantic relationships between labels (classes are not exclusive), class imbalance, and label incompleteness.
The authors developed a new deep learning method Rank-AE that includes (Figure 1):
- a self-attention mechanism to learn rich representations of input texts;
- an auto-encoder to project both features and labels onto the common latent space wherein correlations between features and labels are exploited. The auto-encoder then reproduces labels by decoding;
- a margin-based ranking loss that is more effective for extreme classification settings and more robust than noisy labeling.
ℒ ₕ(xₕ,yₕ) is chosen as the mean squared loss. The architecture is capable of capturing inter-dependencies between labels during training. During inference, the label encoder ℇ is ignored. The reconstruction loss ℒₐₑ(y,y’) is a combination of two margin ranking losses, for positive and negative labels:
The exploited attention mechanism is dual (Figure 2).
First, it weighs word embeddings in the text with TF-IDF. Second, the channel attention is designed to weigh different bits in a word embedding (assuming that some of them emphasize, say, the commercial sense of the term “apple” and the other do that for the agricultural one). The channel attention is implemented as an excitation network (two fully-connected layers with non-linear activations). These kind of networks have been previously utilized only for images. After applying these two attention mechanisms to the embedding matrix, average pooling is used to get a feature embedding x’. The conducted ablation study shows that Rank-AE benefits from the margin-ranking loss on noisy datasets and the attention on complex multi-aspect texts. The post-analysis of attention weights provided in the paper is instructive to explain which text spans contributed to the predicted label.
“Integrating Semantic Knowledge to Tackle Zero-Shot Text Classification” [code] offers a principled approach to zero-shot text classification, i.e., prediction of classes not represented in the training data. For this scenario, although we assume that we have at least their names, and, presumably, short descriptions, inter-class taxonomic or even semantic relationships. The approach is two-staged (Figure 3).
The first phase, coarse-grained classification, predicts if an input document comes from seen or unseen classes. The multi-class classification is broken down to multiple one-vs-rest classification tasks. The authors apply a data augmentation technique to help the classifiers be aware of the existence of unseen classes without accessing their labeled data. Then the second phase, fine-grained classification, finally specifies the class of the input document. It uses either a) a traditional multi-class classifier trained on examples of seen classes only, or b) a zero-shot binary classifier, depending on the coarse-grained prediction given by the first phase. Given a feature vector xᵢ and a class name vector c, the zero-shot classifier takes (xᵢ, c) pairs as input and learns to predict the confidence p(ŷᵢ = c|xᵢ). Feature augmentation based on semantic knowledge is used to provide additional information which relates the document and the unseen classes to generalize the zero-shot reasoning. More details on data augmentation and feature augmentation used:
- topic translation: translating a document word-by-word from its first seen class (represented as a word vector of the class name c) to a new unseen class c’ using a word analogy (via 3COSMUL method):
and preserving the part-of-speech role of the translated word w (nouns ⇒ nouns, verbs ⇒ verbs etc). The translated documents are used to train a zero-shot classifier for a given unseen class. These documents are used as negative examples for learning binary classifiers of seen classes too.
2. feature augmentation: the embedding of each word wⱼ enhanced with 2 vectors:
(a) ũ(wⱼ,c) is a relationship vector that shows how the word wⱼ and the class c are related considering the relations in a general knowledge graph, such as ConceptNet.
(b) ṽ(c) is a word embedding of the class name c.
The experiments on the DBpedia ontology dataset and 20 News groups dataset have shown that data augmentation by topic translation improved the accuracy in detecting instances from unseen classes. Moreover, feature augmentation enables knowledge transfer from seen to unseen classes for zero-shot learning. The method has achieved the highest overall accuracy in each phase and overall compared with the competitive baselines.
“Pooled Contextualized Embeddings for Named Entity Recognition” by Zalando Research takes advantage of character-level LSTM-based contextual embeddings pooled (with min/max/average) across all sentence contexts in a large corpus (Figure 4).
Such global representation features two charming properties:
• pre-training: the improved representation of rare words in underspecified contexts — it benefits from “memorizing” the representation of words (or entities) in other, presumably richer, contexts;
• downstream task training: evolution of word representations as more instances of the same word observed in the downstream task data.
The final string embeddings are formed by concatenating the original contextual embedding and the pooled representation and are also enhanced with standard GloVe or FastText word embeddings. The experiments have shown that the pooled contextualized embeddings boost the performance of BiLSTM-CRF tagger for multilingual named entity recognition (NER) to the new SOTA (outperforming even BERT-NER!). The model implementation is available in the Flair framework.
“Pre-trained language model representations for language generation” [code] by Facebook AI Research explores the different strategies of incorporating pre-trained vectors in seq2seq (encoder-decoder) architecture with applications in machine translation and abstractive summarization. Both the encoder and decoder are implemented as transformers in the Fairseq framework. The considered strategies include:
• adding contextualized word embeddings (ELMo) as input to the encoder or the decoder;
• fine-tuning: replacing the learned word embeddings in the encoder (or, separately, in the decoder) with the LM representation of the layer before softmax.
The experiments show that adding pre-trained representations is very effective for the encoder network (at the expense of 5x training slowdown and only 12–14% inference slowdown, though) in both setups. Interestingly, the quality improvements diminish when more labeled data becomes available, which is in line with the sample efficiency of pre-training discussed in the Part I.
This concludes the Part II. In the final Part III, we’ll give an overview of frameworks and miscellaneous effective techniques (attention and self-attention, model visualization and interpretation, adversarial learning, knowledge distilling, multi-modal learning).
Follow us on Medium to stay in touch. Don’t hesitate to share your experience or ask questions in the comment section.