NAACL ’19 Notes: Practical Insights for Natural Language Processing Applications — Part I
The 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT) was held in Minneapolis, USA June 2–7, 2019. NAACL-HLT is an A-starred peer-reviewed conference, following the Association of Computational Linguistics (ACL) conference, the main event in the world of computational linguistics, aka natural language processing (NLP). In this series of posts, inspired by this year’s NAACL, we overview the most effective methods, techniques and frameworks used by academic researchers and industry engineers to achieve exceptional performance for different NLP tasks. These posts are aimed at hands-on software engineers, machine learning engineers, data scientists, and research scientists specialized in NLP and, preferably, who have a basic knowledge of modern neural networks. It is not intended as a comprehensive resource to cover the whole scope of what is being done in the NLP community. Instead, we focus only on some topics that are the most relevant to Orb Intelligence, such as language representation learning, transfer learning, multi-language support (Part I), text similarity, text classification, language generation, sequence labeling (Part II), frameworks, and miscellaneous techniques (Part III). At the same time, the reader may find fairly generic ideas quite applicable to out-of-scope NLP tasks.
Perhaps the biggest shift in modern NLP is switching from representing each feature as an independent orthogonal dimension (the so-called one-hot representation, which often comes with TF-IDF weights) to dense vector representation (see Figure 1). That is, each raw feature is embedded into a low-dimensional space (which is already convenient from an engineering perspective) and represented as a vector in that space. The main benefit of the dense representations is that the features are no longer independent and captured similarities or other mutual regularities help them generalize better (see Distributional semantics hypothesis). Another advantage is that the embeddings can be trained like other parameters of the neural network, optimizing some meaningful objective (prior to that, Latent Semantic Analysis, Brown clustering, and Latent Dirichlet Allocation were widely used for this purpose).
At the same time, the distributional semantics hypothesis has innate limitations:
• common sense is implicit and usually is not written down;
• word embeddings are prone to biases (e.g., ethical, social, professional);
• no grounding from modalities other than text.
The last point is fascinating because it is the main difference between artificial neural language learners (as slow and data-hungry learners) and human learners (particularly kids) who are capable of learning a language faster with few examples. We tend to think that the line of work on multi-modal learning, i.e., combining various kinds of inputs (text, audio, image, video) during learning, seems like the next potential breakthrough in AI (see examples of its first steps in Part III).
Back to reality, in this Part I, we cover some essential aspects of modern language representation learning techniques.
S. Ruder, T. Wolf, S. Swayamdipta, and M. Peters gave an excellent tutorial “Transfer learning in NLP” at NAACL ’19. Transfer learning in NLP is considered a problem of automatically learning representations that transfer across tasks, domains, and languages with neural network-based methods for natural language processing (see Figure 2).
In principle, transfer learning hypothetically (and more and more empirically) outperforms supervised learning except for the scenarios where no relevant information is available, or there is already a sufficient number of available training examples.
Pre-training is the first basic step of transfer learning. The core idea of pre-training is that the internal representations trained to solve a primary NLP task with self-supervision (i.e., unlabeled data) could be useful for other tasks as well. For example, word embeddings learned for a simple word prediction task in context, or word2vec, have now become essential building blocks in state-of-the-art NLP models. More difficult language modeling (LM) tasks, such as sentence prediction, contextual word prediction, and masking word prediction will be overviewed below in this blog post. There also exists neural machine translation (NMT) (e.g., CoVE) and autoencoding-derived embeddings. According to the current consensus on what is known about pre-training, the tutorial authors observe that:
- in general, the choice of a pre-training task and an end task is coupled, i.e., closer pre-training mimics the target task, better results;
- language modeling objective empirically works better than translation and autoencoding for downstream tasks:
– it’s hard enough, and LM models have to compress any possible context (syntax, semantics, factual knowledge) to generalize over possible completions;
– it’s scalable, and more data for pre-training and more parameters ⇒ better LM model, better word embeddings;
- pre-training improves sample efficiency, i.e., less annotated data for the end task is usually needed to achieve the same level of quality and faster convergence.
A pre-trained model needs to be adapted to the target task. This raises three essential questions:
1. how much to change the pre-trained model architecture for adaptation (architectural modifications);
2. which weights to train during adaptation and following what schedule (optimization schemes);
3. how to get more supervision signals for the target task (weak supervision, multi-tasking, and ensembling).
Firstly, we have two options for architectural modifications:
1a) Keep the pre-trained model internals unchanged. Remove a pre-training task head if not useful for a target task. Add transfer task-specific layers (randomly initialized) on top/bottom of the pre-trained model.
1b) Modify the internals. This includes adapting to a structurally different target task. For example, pre-training with a single input sequence to a task with several input sequences (translation, language generation), namely, one may use pre-trained weights to initialize multiple layers of the target model (LM for initialization of encoders and decoders in MT). Another direction is task-specific modifications, such as adding skip/residual connections and attention layers. Finally, adding adapters, or bottleneck modules, between layers of the pre-trained model. Adapters reduce the number of parameters for tuning, allowing other “heavy” layers to stay frozen during the transfer. They may contain different operations (convolution, self-attention) and are usually connected with residual connections in parallel to an existing layer, e.g. see this recent paper [code] that introduces adapter modules after multi-head attention and feed-forward layers in Transformer.
Secondly, the possible decisions about optimization include:
2a) (not-)tuning of pre-trained weights. Unless we change pre-trained weights, we end up with options like feature extraction and adapters. If the pre-trained weights change, then fine-tuning is employed. In this case, the pre-trained weights are used for parameter initialization of the end task model. Generally speaking, if a source task and a target task are dissimilar (i.e., the source task does not include relations that are very beneficial for the target task), feature extraction is preferable in practice (see details in this paper). Transformers (e.g. BERT) are usually easier to fine-tune than LSTMs (e.g. ELMo).
2b) learning schedule. This is a decision about what weights to update, in what order and at which rate. Our motivation is to prevent overwriting of useful pre-trained knowledge (catastrophic forgetting) and retain transfer benefits. The good techniques include: updating from top to bottom (usually top layers are task-specific, bottom layers convey more general knowledge such as morphology and syntax), varying learning rates between different learning stages, and adding regularization to prevent deviation of parameters from the pre-trained region.
Thirdly, let’s consider some ideas to get more supervision:
3a) fine-tuning of a model on a single adaptation task. For example, for a text classification task, extract a single fixed-length vector from the model (last hidden state or their pooling). Project to the classification space with an extra classifier, extending the top layer. Train with a classification objective.
3b) related datasets. Here we have:
- sequential adaption: intermediate fine-tuning on related datasets and tasks;
- multi-task fine-tuning with related tasks: employ the combination of loss functions, then, for each optimization step, sample a task and a batch for training, and, in the end, fine-tune only on the target task;
- dataset slicing: use auxiliary heads that are trained only on particular subsets of the data and detect automatically challenging subsets, on which the model underperforms (see these features in Snorkel);
- semi-supervised learning: minimize the distance between predictions on original inputs and their skewed versions to make them more consistent with unlabeled data.
3c) ensembling. This means ensembling independently fine-tuned models by combining their predictions. To get uncorrelated predictors in the ensemble, models can be trained on different tasks, dataset splits, parameter settings, and variants of pre-trained models. This direction also includes knowledge distilling (see details in Part III).
A good illustration of these principles is a NAACL ’19 paper entitled, “An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models” that introduces SiATL [code], a simple and efficient transfer learning method for text classification tasks. SiATL (see Figure 3) follows a standard approach with pre-training a language model and transferring its weights to a classifier with a task-specific layer.
To prevent the catastrophic forgetting of the language distribution, the model incorporates an auxiliary LM loss along with a classification loss. Moreover, the contribution of the auxiliary LM loss is controlled via exponential decay over training epochs, gradually shifting focus from the LM objective to the classification task. Similarly to ULMFiT (see details below), SiATL benefits from a sequential unfreezing of layers: fine-tuning additional parameters, fine-tuning pre-trained parameters without an embedding layer, and train all layers until convergence. The authors also carefully chose optimizers: stochastic gradient descent (SGD) with a small learning rate is used for fine-tuning, Adam is used for randomly initialized LSTM and classification layers for faster training. While not beating the state of the art (SOTA), SiATL outperforms more sophisticated transfer learning approaches, particularly on small datasets.
Last but not least, we’ll mention a few sources of pre-trained models that can be used for transfer learning:
An important aspect of representation learning is the basic unit the model operates on. The authors of “One Size Does Not Fit All: Comparing NMT Representations of Different Granularities” consider four representation units: words, byte-pair encoding (BPE) units, morphological units (e.g. obtained with Morfessor), and characters. BPE splits words into symbols (a symbol is a sequence of characters) and then iteratively replaces the most frequent sequences of symbols with a new merged symbol. BPE segmentation is very popular in neural machine translation (NMT). Morfessor splits into morphological units, such as roots and suffixes.
The authors evaluate the quality of NMT-derived embeddings originating from units of different granularity when used for modeling morphology, syntax, and semantics (as opposed to end tasks such as sentiment analysis and question answering). Their approach extracts feature representations from the encoder of a trained LSTM-based NMT model and then training a logistic regression classifier to make predictions for an auxiliary task.
The authors came to the following conclusions:
- the best-performing representation unit is target task-dependent;
- representations derived from subword units are better for modeling syntax (i.e., long-range dependencies);
- character-based representations are clearly better for modeling morphology;
- character-based representations are very robust to misspellings;
- using a combination of different representations often works best.
Misspelling-Tolerant Word Embeddings
Standard word2vec approaches often poorly represent malformed words and their corrected counterparts (we usually like to have similar embeddings for them), which is a severe flaw in real-world applications. While FastText incorporates character n-grams for learning word embeddings, by design, it tends to capture morphemes rather than misspellings. In “Misspelling Oblivious Word Embeddings”, Facebook AI researchers present MOE, a simple method to learn word embeddings that are resilient to misspellings. They extend the FastText objective with a spelling correction objective.
They extend the FastText objective with a spelling correction objective Lₛ . Denote (wₘ, wₑ) ∈ M is a set of pairs such that wₑ is the correct expected word and wₘ is its misspelling. N is a set of negative samples. Let l(x) = log (1 + e⁻ˣ) is the logistic loss function. The scoring function ŝ is defined over input vectors of subwords as follows:
The first term raises the likelihood of wₑ given wₘ, pushing their representations closer. It is trained by standard SGD jointly with the regular FastText loss function, combining both of them through a weighted sum. The experiments on the word similarity and word analogy tasks show that while FastText is indeed capable of capturing low edit distance misspellings, MOE is better at capturing more distant examples. The paper also comes with a nice bonus — the published dataset of more than 20M corrections collected from Facebook search query logs. The experiments were conducted on English datasets. The support of multiple languages is left for future work.
Contextual Word Embeddings
By dynamically linking words to their various contexts, contextual word embeddings provide a richer semantic and syntactic representation than traditional context-independent word embeddings. There are two effective approaches to build and re-use contextual word embeddings: feature-based (e.g., ELMo) and fine-tuning (ULMFiT, OpenAI’s GPT, and Google AI’s BERT, while having the feature-based mode is more effective when fine-tuned).
ELMo [code] pre-trains a bidirectional LSTM-based character-level language model and extracts contextual word vectors as learned combinations of hidden states (see Figure 4). For downstream tasks, these word embeddings are used as inputs without any changes (so, they serve like features). Upon publication in 2018, ELMo has shown state-of-the-art (SOTA) results on 6 diverse NLP tasks.
Hence, we briefly describe the best fine-tuning approaches to pre-training. Unlike feature-based approaches, fine-tuning provides capabilities to fit the language model to the domain-specific corpus and even downstream tasks, retaining the general knowledge that comes from the initial large corpus. For example, ULMFiT [code, tutorial] pre-trains Salesforce’s AWD-LSTM word-level language model (see Figure 5) and fine-tunes the trained language model in two stages with different adaptation techniques (gradual unfreezing of layers and discriminative fine-tuning with slanted triangular learning rates). ULMFiT has shown SOTA on 6 classification datasets.
Generative Pretrained Transformer (GPT) [code] pre-trains large 12-layer left-to-right Transformer (see Figure 6) and fine-tunes for sentence, sentence pairs, and multiple-choice questions. The flexibility for downstream tasks is achieved through linearizing the structure of the task into a sequence of tokens. GPT reached SOTA for 9 different NLP tasks, and its larger model GPT-2 gained wide publicity with its shockingly good text generation application.
The best NAACL ’19 paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” [code] describes the following approach as further development of GPT (see Figure 7): pre-train sentence and contextual word representations (the use of subword units) using a masked language model and next sentence prediction. Masking gives the ability to include both left and right context during word prediction. The BERT-large model has an impressive 340M parameters and 24 layers. BERT is currently the most useful pre-training mechanism (however, the recent XLNet [code], claiming outperformance of BERT on 20 NLP tasks, is worth checking out).
Cross-Lingual Word Embeddings
Multilingual embeddings have been demonstrated to be a promising means for enabling cross-lingual transfer in many NLP tasks. There are mostly two orthogonal approaches to achieve this goal. The first is cross-lingual polyglot pre-training: sharing vocabulary and representations across languages by training one model in many languages. While it’s easy to implement, it often leads to under-representation of low-resource languages. The notable examples of this approach include Facebook Research’s LASER and multilingual BERT from Google AI.
Secondly, training word embeddings independently for each language of interest and then aligning those monolingual word embeddings. For example, Facebook Research’s MUSE implements this approach for FastText token-level embeddings. The paper “Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing” [code] presents a method to align contextual word embeddings based on ELMo. Interestingly, point clouds from different words are well separated in practice, so the authors introduce an embedding anchor eᵢ as a centroid of a point cloud for a word i. Another interesting property of the space structure is multi-modality of homonym point clouds (see Figure 8).
That is to say when a word i has multiple distinct senses, one might expect the embeddings for i to reflect this by separating into multiple distinct clouds, one for each meaning. The proposed approach adapts two well-known context-independent alignment methods (Mikolov et al (2013) and MUSE), replacing word vectors with embedding anchors:
- supervision through a given word dictionary between source and target languages. Then, the problem is reduced to the orthogonal Procrustes problem of finding an optimal linear transformation (i.e., geometrically scaling, rotation, reflection, etc.) between matrices:
which has a closed-form solution W^s→t = UVᵗ, where columns of U and V are the left and right singular vectors of the multiplication of the source and (transposed) target embedding matrices;
- the unsupervised setting, when the dictionary is built automatically by the adversarial framework implemented in MUSE, then the first method is applied.
The authors have shown that these aligned embeddings supply good word translation (including low-resource languages, such as Kazakh) and improve significantly upon state-of-the-art zero-shot and few-shot cross-lingual dependency parsing models.
This concludes the Part I. In the Part II, we’ll present some recent advances in important end tasks, such as text similarity, text classification, language generation, and sequence labeling. Follow us on Medium to stay in touch. Feel free to share your experience or ask questions in the comment section.