This blog post gives a glimpse of recently published research work by Google on Neural Data Augmentation using Text-to-Text Transfer Transformer(T5) model. Data Augmentation is something that is required in every area either it be NLP, Computer Vision or any other field. Data, if available in large amounts or can be generated in large amounts will always help the deep learning models to learn the patterns better and generalize over the unseen dataset well. Also Data augmentation is a popular solution for biased or imbalanced data, which can be approached either by duplicating examples or using heuristics to synthesize new examples.
Recently, Google published this paper titled: “Neural Data Augmentation via Example Extrapolation” (https://arxiv.org/pdf/2102.01335.pdf) which talks about how synthetic examples can be generated for under-represented slices of the data and shows better performance as compared to other techniques like up-sampling the under-represented data slice or any other method over relation extraction (FewRel) and intent classification/slot-filling tasks (CLINC150 and SNIPS) and shows improvements over the state-of-the-art. They call this model as Example Extrapolation(Ex2). Cool!! let’s first look at below images before diving into more theory.
Looking at above figure, we will get a complete idea about how this approach goes and trains a seq2seq model in order to generate data for under-represented slices. Ok, so first the point is to find the under-represented slices in the data and first of all before that what is a data slice?. According to the paper, a slice could be the set of all examples sharing a given label, or all examples in a particular language, or with a particular syntactic construction.
Their model which they call as Example Extrapolation(Ex2) makes no assumption about how the data is sliced and it is totally up to the practitioner to slice the data in a way that exposes important but under-represented slices, which Ex2 can then target for data augmentation. So, the flow for end-to-end process goes as follows:
- Divide Dataset into slices.
- Train an example extrapolator using data from those slices.
- Use the example extrapolator to generate new synthetic data for under-represented slices of the dataset.
- Train a model on the union of the synthetic data and real data.
As underrepresented slices have only a few examples each, so paper refer to these as few-shot slices(denoted as F)and for these slices data augmentation is performed. The remaining slices of the data are called many-shot slices(denote as M) and these have enough data and will not receive data augmentation. The example extrapolator(Ex2) is trained with M only and used to infer new examples in F despite never having seen any examples in F during training. Its a very cool idea :). Also the term “few-shot” here means that there are slices of the data within the task that have very few examples. The other notion of “few-shot” which indicates there are overall few examples for the entire task, is outside the scope of this paper.
Let’s say a practitioner defines the list of S slicing functions where each func-tion slice_s(e) is a Boolean function indicating whether example belongs in slice_s(potentially overlapping with other slice functions). So overall it looks like this:
The training procedure of T5 Transformer model over slices denoted with M containing many examples can be understood with the below snippet:
To optimize this above objective, training procedure iterate over all training slices (s∈M), and every example (e^∗) in each slice. For each example, the procedure sample K other examples (e_1:K) from the same slice, excluding e_∗ itself. Training procedure then optimize the log-likelihood of e_∗ as output given e_1:K as input.
The authors used T5 transformer model, which is a Text-To-Text Transfer Transformer Model that was pre-trained on a large text corpus. This pre-training provides the network with a large amount of world knowledge, which is crucial for the model’s ability to extrapolate beyond the given examples. You can read more about T5 from this blog and from the official paper as well.
Evaluation over standard benchmark datasets
The author evaluated their strategy using standard tasks like classification, slot filling, relation extraction with the respective standard benchmark datasets . They compared their model with baseline techniques and another strategy for data augmentation called up-sampling(samples from “few-shot” slices are up-sampled to match the median frequency of the many-shot slices in order to remove the imbalance and improve training.) Some of the results are as follows:
Examples generated using Ex2
The authors show some examples generated by Ex2 after training over CLINC150 dataset. Also in the paper they called the model which they train in order to generate examples for few-shot slices as Teacher model and the downstream model which is trained using both the actual data and the augmented data is called as Student model. This is an analogy to distillation where an already trained bigger teacher model is used to train a smaller student model or where a “teacher” is used to label a large number of unlabeled inputs (x’s) to be consumed by a “student”. The Ex2 approach is similar, except that the teacher does not label pre-existing x’s and instead synthesizes completely new(x,y)pairs.
This paper titled “Neural Data Augmentation via Example Extrapolation” proposes a very novel idea to do data augmentation using neural models that are already pre-trained on a very large corpus and have world knowledge present in them. Also these models are able to generalize well on new/unseen examples and generate new examples that follows similar distribution corresponding to example distribution in few-shot slices of the data. This techniques shows improvement over baselines and standard methods to do data augmentation like up-sampling.
- https://arxiv.org/pdf/2102.01335.pdf (Neural Data Augmentation via Example Extrapolation)
- https://arxiv.org/pdf/1910.10683.pdf (T5 paper)