Highlights of ACL 2022

Bokai Yu
Criteo Tech Blog
Published in
5 min readJun 6, 2022

From 22th May to 27th May, I went to the lovely city, Dublin, to attend the 60th Annual Meeting of the Association for Computational Linguistics (ACL).

Overview of ACL 2022

The Convention Centre Dublin, the Conference Venue of ACL 2022

The ACL conference is intended to exchange research ideas and progress in the field of natural language processing (NLP).

ACL 2022 is a hybrid conference, and the first in-person ACL in three years. There are about 3,200 registrations, among them, half of the people attended in-person.

The conference program consisted of 3 parts:

  • 1 day of tutorials
  • 3 days of main conference
  • 2 days of workshops

The main conferences and workshops are composed of a set of keynotes presentations and several poster sessions.

Highlights

There is too much good work presented in this conference for me to summarize it all. Due to the limited time, I didn’t attend every keynote presentation but spent a lot of time discussing with authors in the poster sessions. I have been mainly focusing on the tracks about Machine Learning and Multilingual, Information Extraction, Zero-shot/Few-shot and other topics that caught my eyes. Here are my highlights!

Tutorials

I attended two tutorials, one in the morning and the other in the afternoon.

The tutorial Learning with Limited Text Data with Diyi Yang, Ankur P Parikh, Colin Raffel provided an overview of the up-to-date approaches on data augmentation and semi-supervised learning for NLP. The following is interesting work:

  • Chen et al., 2020 provides a new method to expand a large amount of augmented training samples by interpolating text in hidden space.
Screenshot from Chen et al., 2020
  • Du et al., 2022 introduces SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a pool of unlabeled dataset.

The takeaway is: No single augmentation works the best for every task and augmentation does not always improve performance.

The other tutorial that I attended is Vision-Language Pretraining: Current Trends and the Future with Aishwarya Agrawal, Damien Teney, Aida Nematzadeh. The goal of this tutorial was to give us an overview of current trends on multimodal problems, particularly vision and language. In the second part of this tutorial, the presenter tried to answer three questions:

Screenshot from the tutorial
  • Is the masked region modeling loss, which, adapted from the language models good enough ?
  • Is the cross-talk between modalities (via attention) important ?
  • What makes a good pretraining dataset ?

Main Conferences

Some token-free NLP models are appearing

Token-free models means that they do not rely on a learned vocabulary to map words/subwords to tokens. Instead, they operate directly on the raw texts.

  • Clark et al., 2022 presents a neural encoder that encodes directly on characters sequences, which outperforms significantly multilingual BERT model on a challenging multilingual benchmark;
  • Xue et al., 2022 introduces ByT5, which processes texts as UTF-8 bytes on both encoder (3x deeper than the decoder) and decoder.
Screenshot from Xue et al., 2022

More and more papers are analyzing prompting methods

With the appearance of the extremely large model GPT-1/2/3, the “pre-train, fine-tune” procedure is replaced by the “pre-train, prompt” paradigm (Liu et al., 2021), many accepted papers this year have conduct some analysis on this topic.

  • Lu et al., 2022 demonstrates that the order where the samples are provided can make a huge difference on the performance, and such problems are present across various model sizes. To address this problem, they propose a method to construct synthetic development sets and select the best candidate permutations on this set based on entropy statistics;
Screenshot from Lu et al., 2022
  • Prompting methods are not only used in language problem but also in multimodal problem, Jin et al., 2022 proposes FEW VLM, a pre-train seq-2-seq transformer model with prefix language modeling and masked language modeling and shows that significant improvement on zero-shot performance;
  • There is also a tutorial Zero- and Few-Shot NLP with Pretrained Language Models that has covered the topics around prompting learning, in-context learning and other zero/few-shot approaches.

Multilingual

Multilingual or cross-lingual problems have always been a hot track, a lot of work have been done in different aspects:

  • In terms of the quality of datasets, Kreutzer et al., 2022 manually analysis the quality of several multilingual datasets especially for low-resources languages; Lee et al., 2022 find that the existing datasets have not been sufficiently deduplicated and deduplicating the training data reduces memorization by 10x;
  • Regarding to the new pretrained models, just name a few, Feng et al., 2022 proposes a multilingual sentence embedding model covering over 109 languages based on masked language modeling and translation language modeling; Zhou et al., 2022 proposes an approach to construct the knowledge base by leveraging monolingual triples and cross-lingual links via a language modeling setup; De Cao et al., 2021proposes a effective entity linking method by predicting entities by generating their names instead of operating dot-product search among items in a knowledge base;
  • For better adaption from one language to other, Zhang et al., 2022 explores the different properties of zero-shot transfer from sentences to document in machine translation; Aepli et al., 2022 finds that injecting character-level noise can help improve cross-lingual transfer if two languages are quite related.

Workshops

I have attended mainly two workshops: e-Commerce and NLP (ECNLP) and Multilingual Multimodal Learning (MML).

ECNLP

In this workshop, there are many papers about the product attribute extraction. Among them, Fuches et al. 2022 frames the product attributes extraction as a multi-labels classification problem and observes that a CNN Seq2Seq model (Gehring et al. 2017) outperforms BERT based model and the Ebay in-house pretrained language model.

MML

One of the takeaways is that more and more NLP researchers are putting efforts on improving low-resources languages either in dataset (Adelani et al. 2021) or in better pretrained language model (Alabi et al. 2022).

To conclude, it was a wonderful journey to be able to join such an amazing conference.

Last but not least, I would like to thank all the people at Criteo who supported me to participate in ACL 2022 and to thank all the organizers, presenters from both academic and industry all over the world who contributed to ACL 2022.

--

--