Latest Trends in Data Scarce Machine Learning

RISHABH TRIPATHI
The Observe.AI Tech Blog
7 min readFeb 3, 2023

EMNLP 2022 spotlighted a plenty of innovative works on machine learning in data-scarce scenarios. The list of papers presented in this area included topics like few-shot learning, zero-shot learning, prompt-based learning etc.

Similar to a prior work on prompt-based learning a.k.a LMBFF which uses demonstrations to help PLMs better understand the task, a couple of works were proposed. A few of them include:

→ Contrastive Demonstration Tuning for Pre-trained Language Models

  1. Liang et al. presented their work titled “Contrastive Demonstration Tuning for Pre-trained Language Models” and reported improvements over LMBFF on the performance metric.
  2. Unlike LMBFF which performs filtering of demonstrations on the basis of semantic similarity, this work proposed the idea of concatenating virtual demonstrations in the input space.
  3. A virtual demonstration is nothing but learnable continuous embeddings that are trained via optimizing contrastive loss as a part of prompt-tuning.
  4. The proposed approach shows improvement over the prior related works and is pluggable with any popular prompt-based learning technique (LMBFF, PET).
  5. The work targets to overcome the limitations of a semantic similarity-based filtering approach (possessed by LMBFF) and assumes that such filtering-based approach can be misleading as the semantic similarity does not always guarantee to prioritize the most informative demonstrations.
  6. Moreover, including text-based demonstrations can make the input length exceed the max effective context length of the model. Moreover, this can result in an error if the [mask] token present in the prompt of the test sample goes beyond this length.
  7. The proposed contrastive demonstration tuning which is a simple model-agnostic approach for pre-trained language models improves the state-of-the-art prompt-tuning performance without the necessity of demonstration selection.
  8. However, the model leverages the pre-trained language model. Thus, it is necessary to cost GPU resources. Besides, in few-shot settings, the performance gains are still limited with virtual demonstrations learned in only a few training instances. It is worth studying retrieving relevant context from the internet as “demonstrations” to help efficient NLP.

→ IDIAPers @ Causal News Corpus 2022: Efficient Causal Relation Identification Through a Prompt-based Few-shot Approach

  1. This work was presented by Burdisso et al. titled “IDIAPers @ Causal News Corpus 2022: Efficient Causal Relation Identification Through a Prompt-based Few-shot Approach”.
  2. The proposed work was inspired by LMBFF methodology, to perform a few-shot demonstration-based few-shot classification to identify if an event is a causal event or a random event.
  3. The key difference by this work was to train the network with MLM without introducing any additional parameters.
  4. The proposed approach uses a few-shot configuration in which a prompt-based model is fine-tuned using only 256 instances per class.
  5. Moreover, the comparison against traditional fine-tuning techniques, ensemble approaches, as well as the other participating models, show the potential of the proposed approach for better generalizing the posed task.

Some of the research works presented at the event were on the domain adaptation with low-resource (data scarce) machine learning. A few of them include:

→ Prompt Learning for Domain Adaptation in Task-Oriented Dialogue

  1. Sreedhar et al. presented a paper titled “Prompt Learning for Domain Adaptation in Task-Oriented Dialogue” by casting the problem of intent classification into a generative approach and rewriting intent labels in a more descriptive format (canonical forms).
  2. When using such canonical forms, generative approaches with Large Language Models(LLMs) show promising results when compared with traditional methods for intent classification.
  3. The work demonstrates that the generative models generalize very well to unseen domains in prompt-based zero-shot and few-shot settings when compared with BERT-style approaches.
  4. The assumption is that shorter intent labels cause difficulty to generalize well to newer domains while longer intent labels tend to hallucinate.
  5. In this work, P-tuning involves training a 2-layered LSTM with frozen LLM to create a soft prompt vector. The learned soft prompt is concatenated with input embeddings. For evaluating the canonical forms, it is assumed that the ground truth canonical forms are given so that a nearest neighbor search can be performed with fast-text embeddings and sentence transformers.
  6. The work considers various sizes of Megatron and BERT as the PLMs. As the observation, it is noted that for in-domain classification, P-tuning on training set and then evaluation works fine but for newer domains, the training was performed in two ways:
    - Zero-shot (P-tune with source domain and zero-shot inference on target domain).
    - Few-shot (P-tune on source domain and then continue p-tune of few-shots from target domain. This is followed by evaluation on target domain).
  7. As a conclusion of this work, it is observed that generative approaches using p-tuning work well for intent classification. This is particularly promising for few-shot domain adaptation.

→ Improving the Sample Efficiency of Prompt Tuning with Domain Adaptation

  1. Guo et al. presented their work titled “Improving the Sample Efficiency of Prompt Tuning with Domain Adaptation” a.k.a OPTIMA to demonstrate an approach, first of its kind. which involves boosting soft prompts for enhancing domain adaptation without the need of labeled target domain data.
  2. It considers the unsupervised target domain data to target the domain shift.
  3. The underlying assumption in this work is that the smooth boundary is more robust against adversarial perturbations if the domain distributions of source and target domains are similar.
  4. It is worth noting from the prior works that P-tuning requires large amounts of labeled data for training an informative prompt vector and is generally underperformed by full-model tuning in data scarce scenarios.
  5. Catering to partial overlaps of the data distributions, this work proposes a targeted regularization technique that encourages smooth decision boundaries only in the areas where the two domains are similar.
  6. The procedure follows a prompt-tuning strategy to perform text classification. The work introduces how to enhance in-domain generalization performance of soft prompts by augmenting the input with virtual perturbations. Next, a strategy is proposed to optimize the perturbations to reduce the domain gap and obtain soft prompts with domain-invariant knowledge. Finally, this work demonstrates how to use the soft prompts to boost few-shot learning in the target domain.
  7. Moreover, it is worth noting that P-tuning based approaches are more sensitive to random seeds when compared to a full-model tuning approach. However, OPTIMA is robust to random seeds in sensitivity.
  8. The proposed work have some limitations as well:
    — The demonstrated regularization technique addresses the situation where the source and target domains have different data distributions.
    — When the two distributions are exactly the same, the technique degenerates to simply adversarial training.
    — When the two distributions are extremely dissimilar, the transfer is unlikely to yield performance improvements. A unified framework that automatically detects domain distances and applies the correct method may be desirable.
    — Furthermore, the power of perturbations has the most effect in the few-shot / zero-shot settings. When the target domain has abundant labeled data, the gap between soft prompt tuning and our method will likely diminish.

A lot of the papers were presented to solve NLG (Natural Language Generation) tasks using prompt-based methods at the event. One of these works is discussed below:

Most NLG is Low-Resource: here’s what we can do about it

  1. Howcroft et al. presented their work titled “Most NLG is Low-Resource: here’s what we can do about it” carried out a study around the explainability of NLG systems, calling the fact that most of the NLG systems are low-resource.
  2. The definition of low resource is studied at two levels: Language (for which the data is scarce) and Domain (the data points for a certain domain is in scarce).
  3. The paper calls out some of the promising directions to tackle NLG systems in low resource setting by data augmentation that can be carried out by word substitutions, paraphrasing using large language models (LLMs) or back translation.
  4. Prompt-based learning is another alternative that can help PLMs understand the tasks in data-scarce scenarios by passing instructions to the language models. For instance, the capabilities offered by GPT-3’s in-context learning to create generative data.
  5. If an NLG objective is composed of multiple auxiliary tasks that are complementary then learning them jointly leveraging the multi-task learning scheme can turn out helpful.
  6. However, the relative novelty of the topic makes it difficult to determine appropriate selection criteria for a systematic survey on this topic.
    — Usually a systematic survey would include papers based on keyword searches in academic databases, but almost no papers explicitly focus on natural language generation in low-resource settings, making it difficult to identify phrases which reliably indicate all and only the relevant works.
    — This limits the coverage of the current paper, though we believe this limitation is a reasonable trade-off when highlighting an area requiring more attention in future work.

There were a plenty of papers on prompt-tuning to improve the performance metric but only a few of the presented works addressed the problems related to efficiency of machine learning models. We discuss one such work which proposed the training strategy to improve efficiency by multiple folds.

FPT: Improving Prompt Tuning Efficiency via Progressive Training

  1. Huang et al. presented an interesting work titled “FPT: Improving Prompt Tuning Efficiency via Progressive Training” around explainability of prompt-tuning is proposed which also demonstrates how progressive training can improve model efficiency.
  2. The motivation of this work is that the downstream task fine-tuning is expensive and proliferating to tune all parameters of a PLM. As an alternative, prompt-tuning is proposed which consists of a few virtual tokens. However, it is also slow and training inefficient.
  3. This work proposes some promising directions to overcome the aforementioned challenges:
    — Layer dropping: Adjacent layers have similar information and can be discarded.
    — FFN Reduction: Only part of the neurons in the network is activated.
    — Compound Reduction: A combination of layer dropping and FFN Reduction.
  4. The proposed strategy recommends starting the training using one of the above strategies using a partial PLM which is formed by splitting the PLM into N stages. As the training starts, it is suggested to rehabilitate the size and depth of the PLM. This training continues until the PLM is completed to its original size and depth.
  5. Moreover, during the progressive training, the soft prompts are also trained progressively at each stage. It is observed that all three proposed directions showed comparable performances and utilized fewer computations and training time. However, compound reduction appeared to be more training efficient.
  6. Furthermore, T5 (X-large) demonstrated significant improvement in efficiency when compared to T5 (large) which suggests that the proposed mechanism works better with LLMs (large language models).
  7. As limitations of the paper, it is noted that FPT requires choosing a proper hyper-parameter of the progressive training steps (i.e. duration of each training stage). Moreover, FPT can not be directly applied to other delta tuning methods (e.g., adapter and prefix-tuning).

--

--