Efficient transformers: Survey of recent work

Published in

Data Science at Microsoft

16 min readSep 20, 2022

By Dr. Vijay Srinivas Agneeswaran and Dr. Badri Narayana Patro

Transformers such as Bidirectional Encoder Representations from Transformers (BERT) and Turing Natural Language Generation (T-NLG) from Microsoft have recently become popular in the Machine Learning world for Natural Language Processing (NLP) tasks such as machine translation, text summarization, question answering, protein fold prediction, and even image processing tasks.

In this article we build on a survey of efficient transformers [Tay 2022] to provide a slightly different characterization of transformers in our own survey. We also include more recent work on advanced transformers (especially those published in 2021 and 2022) in our current survey. Interesting research directions open up as a result, which we discuss to conclude this article.

Our survey has produced the following illustration of transformers. Although this diagram is similar to the diagram in the survey paper [Tay 2022], the transformers surveyed are all recent and their overall categorization is also quite different.

Figure 1: Venn diagram of the efficient transformer models. This includes the robustness of a model, the privacy of a model, spectral complexity of a model, model approximations, computational complexity of a model, and model compression techniques.

As we show in the diagram, the major categories include computational complexity, spectral complexity, robustness, privacy, approximation, and model compression. We review each in turn.

Computational complexity

These transformers address the O(N2) computational complexity of transformers in various ways. One of the key issues in a transformer is its quadratic complexity with respect to the input sequence length — along dimensions relating to both computation and memory. The implication is that one has to compute the N*N attention matrix for every layer and attention head. Various approaches have been tried to reduce this O(N2) complexity, including the use of caching architectures.

Sparse transformer is one of the popular methods to address this complexity. Each output position computes weights from a subset of input positions. If the subset is √(N), then the complexity of the transformer reduces to O(N ∗ √(N)), allowing it to handle longer range dependencies.

Longformer [Beltagy 2020] uses a combination of windowed local attention (where for a window size w, each token attends to w/2 tokens on either side, not to the entire input) plus task-motivated global attention using special tokens. This helps it outperform Roberta and other state-of-the-art transformers.

Another effort known as BigBird [Manzil 2020] uses graph sparsification techniques. Specifically, it uses a special graph known as the Watts-Strogatz graph, which approximates a complete graph, to achieve linear complexity in input sequence. The authors show that BigBird is Turing complete under standard precision assumptions. They also evaluate BigBird on tasks that require long range dependencies, specifically on extracting genomic sequences such as DNA and predicting the resulting chromatin profile. Linformer approximates the dot product attention operation using a combination of linear projections and low rank factorization [Wang2020].

Many of the sparse matrix-operation–based transformers above may need sparse matrix multiplication operations, which may not be available on all architectures. They may also tend to stack more attention layers to compensate for sparsity, thus leading to significant energy consumption in general. It may also not be easy for certain operations such as the softmax operation (which is widely used in recommender systems as well as action selection in reinforcement learning); even multinomial probit operation may not be easily sparsified.

Google has proposed Performer, a generalized attention framework, which can specify a broad class of attention mechanisms based on different similarity measures or kernels. They have used a Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm to implement the attention framework. They also show that regular softmax attention can be approximated by a combination of exponential functions and randomized Gaussian projections. They also show that the Performer outperforms standard Transformer models for protein sequence prediction tasks, among others. Below, we give a categorization of the various transformers, including recent work in this space.

Wang et al. [Wang 2021] have proposed a Pyramid vision Transformer (PVT) for dense prediction without convolutions. Vision-based transformers encounter difficulties while porting these transformers to dense prediction tasks. This issue is overcome by the PVT. PVT is helpful for various pixel-level dense predictions without convolution and without non-maximal suppression such as object detection methods. It is easy to port transformers using progressive shrinking pyramid and spatial reduction attention. Finally, PVT is evaluated on various tasks such as image classification, object detection, instance, and semantic segmentation tasks.

Liu et al. [Liu 2021] have discussed the issue of adapting transformers from the language domain to the vision domain in ways that encompass a large variance of visual entities and high-resolution pixels of images compared to words in text. To address this issue the authors proposed Swin Transformer [Lui 2021], a hierarchical transformer method whose representation is computed using shifted windows. This technique overcomes the issue of non-overlapping local windows of self-attention more efficiently.

Chu et al. [Chu 2021] have discussed the importance of spatial attention for success in the transformer’s performance on various tasks. The authors proposed two simple and efficient architectures such as Twins-PCPVT and Twins-SVT. This paper uses a separable depth-wise convolution attention machine known as spatial-separable self-attention (SSSA). SSSA uses two types of attention operations: A locally grouped self-attention (LSA) and a globally sub-sampled attention (GSA). LSA deals with fine-grained and short distance information, while GSA deals with long distance sequences and global information. The second proposed method, Twins-SVT, uses both LSA and GSA with matrix multiplication. The authors compare Twins-PCPVT with the similar architecture PVT [Wang 2021] and Twins-SVT with similar architecture Swin [Liu 2021] transformer.

Spectral complexity

Efficient transformers can be designed to speed up transformer encoder architecture by replacing the self-attention network with linear transformations that mix input tokens. The self-attention layer of the transformer is replaced by a parameterized Fourier transformation (Fnet)[Lee-Thorp 2022], which is then followed by a non-linearity and feed-forward network. Compared to BERT, this network is 80 percent faster, and can archive 92 to 97 percent of the transformer performance.

The Global Frequency network (GFnet) [Rao 2022] proposes a depth-wise global convolution for token mixing. GFnet involves three steps: Spatial token mixing via Fast Fourier Transform (FFT), frequency gating, and inverse FFT for token demixing. GFnet is not involved in channel mixing, is expensive for higher solution images as sequence length increases and is not adaptive.

Guibias et al. [Guibias 2022] formulated the token mixing task as an operator-learning task that learns mapping among continuous functions in infinite dimensional space. Li et al. [Li 2020] discuss solving Partial Differential Equations (PDE) using a Fourier Neural Operator (FNO). FNO works well in continuous domains.

Adapting FNO for a vision domain with high-resolution image inputs requires modification in the design architecture of FNO from PDE. This is because high images have discontinuities due to edges and other structures. Additionally, the channel mixing FNO depends on the channel size, which has quadratic complexity. The block-diagonal structure is used on channel mixing weight to handle this channel mixing issue. The authors shared weights across the tokens of MLP layers for parameter efficiency and introduced sparsity in the frequency domain using soft thresholding for generalization. These solutions combine known as Adaptive Fourier neural Operator (AFNO).

Bai et al. [Bai 2022] have proposed the HAT method (named for High-frequency components via Adversarial Training), which perturbs high-frequency components during the training stage. The HAT method alters the high-frequency components of the training image by adding adversarial perturbation, and then trains the Vision Transformer (ViT) [Bai 2022] model with the altered image to improve the model performance and make the model more robust.

Robustness

Robustness in the transformer is studied in terms of perturbation, common corruption, distributional shift, and natural adversarial examples.

Shao et al. [Shao 2021] analyzed robustness of the transformer model using adversarial perturbation. The authors conduct an experiment with a white-box and a transformer attack setting. They observe that ViT has better adversarial robustness compared to Convolutional Neural Networks (CNNs). They find that ViT features contain low level information that provides superior robustness against adversarial attacks. They note the combination of CNNs and transformers leads to better robustness compared to pure transformer models with increasing size or added layers. Additionally, they find that pre-training larger datasets does not improve robustness. For a robust model the opposite is applicable.

Bhojanapalli et al. [Bhojanapalli 2021] investigated various measures of the robustness of ViT models and resnet models against adversarial examples, natural examples, and common corruptions. The authors have investigated robustness to perturbation both to the input and to the model. It is observed that transformers are robust to remove any single layer from either the input or the model.

Paul et al. [Paul 2022] studied various aspects of robust learning methods of ViT [Dosovitskiy 2020], CNNs, and Big Transformer [Kolesnikov 2020] methods. Paul et al. [Paul 2022] benchmarked the robustness of ViTs on a wide range of ImageNet datasets. Their results are in table-R. Through six experiments, the authors verified that ViT has improved in the robustness compared to CNN and BIG transformer. The results of those experiments include:

Experiment 1: Attention is crucial for improved robustness.
Experiment 2: The role of pre-training is an important one.
Experiment 3: ViT has better robustness to image masking.
Experiment 4: Fourier spectrum analysis reveals low sensitivity for ViT.
Experiment 5: Adversarial perturbation has spread wider in the energy spectrum.
Experiment 6: ViT has smoother Loss Landscape to input perturbations.

Table-R [Paul 2022]: mCEs of different models and methods on ImageNet-C (lower is better). MFRs and mT5Ds on ImageNet-P dataset (lower is better). cAcc tends challenge accuracy. The cAcc column shows performance on detecting vulnerable image foregrounds from the ImageNet-9 dataset. Columns 6, 7, and 8 show top-1 accuracy scores (as percentages) on ImageNet-R, A, O datasets respectively.

ViT [Dosovitskiy 2020] models are less effective at capturing the high frequency component of images as compared to CNNs, as investigated by Park et al. [Park 2022]. HAT [Bai 2022] was the result of a further investigation into the effect of an existing transformer model in frequency perspective. HAT perturbs the high frequency component of the input image with noise using the RandAugment method. Wu et al. [Wu 2022] investigated the issue of transformer models vulnerable to adversarial examples like CNNs. This issue (vulnerability to adversarial noise) is handled in CNNs with the help of adversarial training. But in transformers, the adversarial training has a heavy computational cost due to the quadratic complexity of the self-attention computation. The AGAT method uses an efficient attention-guided adversarial mechanism with removing certainty patch embedding on each layer with an attention-guided dropping strategy during adversarial training.

Privacy

Today, pre-trained transformer models are deployed on cloud systems. One of the main issues in cloud-based model deployment pertains to privacy issues in the data. The major privacy issues are exposure of user data such as search history, medical records, and bank accounts. The current research focuses on preserving privacy in the inference of transformer models.

The paper [Huang 2020] introduced TextHide, a federated learning technique to preserve privacy, but this method is for sentence-based like machine translation, sentiment analysis, paraphrase generation tasks), rather than for token-based tasks (such name entity recognition and semantic role labeling).

Similarly, the DP-finetune [Kerrigan 2020] Differential Privacy (DP) method allows us to quantify the degree to which we can protect the sensitivity of data. But, training a DP algorithm degrades the quality of the model, which can be tuned using a public base model on a private dataset.

Gentry et al. [Gentry 2009] have proposed a method to protect privacy with ciphertext in homomorphic encryption (HE). Due to the computational complexity in GELU [Hendrycks 2016] activation in transformer-based models, the HE solution supports only addition and multiplication.

The paper [Chen 2022] proposed THE-X as a method by series of approximations on the HE [Boemer 2019, Boemer 2020] based solution in a transformer. THE-X method replaces non-polynomial operations with a series of approximations with the help of these layers such as the SoftMax and the GELU layer, drop the pooler layer, add Layer normalization, use knowledge distillation techniques, and then use HE supported operations with HE transformer. THE-X method is evaluated using BERT-Tiny Model on GLUE [Wang 2018] and benchmarked for a CONLL2003 [Sang2003] task.

Li et al. [Li 2022] address the issue of performance drop and high computational overhead using differential privacy algorithms. This can be handled using a larger pre-trained language model or transformers (i.e., with a strong non-private baseline), or as a fine-tuning with aligned pre-trained procedure that is fine-tuned with DP optimization on moderate corpora.

Approximation

The paper [Ruthotto 2019] was one of the first to provide a theoretical foundation based on Partial Differential Equations (PDEs) for deep neural networks such as ResNets. More specifically, the author showed that residual CNNs can be interpreted as a discretization of a space-time differential equation. Based on the theoretical characterization, Ruthotto also proposes new models such as hyperbolic and parabolic CNNs with special properties.

Residual networks have also been interpreted as Euler discretizations of Ordinary Differential Equations (ODEs). However, the Euler method of solving is not precise, and has truncation errors as it is a first-order method. The authors of ODE Transformers [Bei 2022] used a classical higher order method (Runge Kutta) to build a transformer block. They evaluated the ODE transformer on three sequence-generation tasks. These tasks, which proved the transformer’s effectiveness, include abstractive summarization, machine translation, and grammar error correction. Another effort in this direction is TransEvolve [Dutta 2021], which provides a Transformer architecture such as ODE transformer, but is modeled on multi-particle dynamic systems.

Transformers have been shown to be equivalent to universal computation engines [Kevin 2022]. The authors have proposed an architecture known as the Frozen Pretrained Transformer (FPT), which can be trained on a single modality (such as text data for language modeling) and identify abstractions (such as feature representations) that are useful across modalities. They have taken a GPT, pre-trained it on only natural language data, and fine-tuned its input and output layers along with the layer normalization parameters and positional embeddings. This has resulted in the FPT performing comparably with transformers trained completely from scratch for a variety of tasks such as protein fold prediction, numerical computation, and even image classification.

Model compression

Touvron et al. [Touvron 2021] proposed an efficient transformer model based on distillation technique (Deit). It uses a teacher-student strategy that relies on a distillation token to ensure that a student learns from a teacher through attention.

Bao et al. [Bao 2021] have proposed a masked image model task to a pretrained vision transformer. The author proposes a self-supervision–based vision representation model, Bidirectional Encoder representation from Image Transformers (BEiT), which follows the BERT [Kenton 2019] method developed for the Natural Language Processing area. In this method each image is considered as two views: One of image patches of size 16 x 16 pixels, and the other of discrete visual tokens. The original image is tokenized into visual tokens, with some of the image patches randomly masked, and then fed to the backbone pre-trained transformer. After training BEiT, the model can be fine-tuned for the downstream tasks.

Possible research directions

The above survey opens several possibilities, including the following:

Approximate transformers are an interesting avenue for further research, especially if we think of other methods (not just Runge Kutta) to solve even partial differential equations and use those to implement and approximate more efficient transformers.
Transformers may have emergent properties at scale, as is evident from Wei 2022 [Wei 2022]. The transformers surveyed in this article can be scaled to see whether they can solve mathematical operations. This may result in transformers performing efficiently in Long Range Arena (LRA) benchmarks [Tay 2021a], which are becoming important in recent literature. For example, ListOps is one task in the LRA where one can expect scaled transformers with emergent properties to perform quite well. Please see our forthcoming article on benchmarking transformers.
A few optimizations for solving PDEs such as AFNO can be implemented in transformers to improve performance.

References

[Alexey 2021] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.”, International Conference on Learning Representation, 2021.

[Ashish 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.

[Bai 2022]Bai, Jiawang, Liuliang Yuan, Shutao Xia, Shuicheng Yan, Zhifeng Li and W. Liu. “Improving Vision Transformers by Revisiting High-frequency Components.” Accepted to European Conference on Computer Vision 2022, available from https://arxiv.org/abs/2204.00993.

[Beltagy 2020] Beltagy, I., Peters, M. E., and Cohan, A., “Longformer: The Long-Document Transformer”, arXiv e-prints, 2020.

[Bei 2022] Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao, Jingbo Zhu, Xuebo Liu, Min Zhang, “ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation”, 60th Annual Meeting of the Association of Computational Linguistics (ACL) (1) 2022: 8335–835.

[Bhojanapalli 2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10231–10241 (2021).

[Boemer 2019] Boemer, F., Lao, Y., Cammarota, R., Wierzynski, C.: ngraph-HE: a graph compiler for deep learning on homomorphically encrypted data. In: Proceedings of the 16th ACM International Conference on Computing Frontiers, pp. 3–13 (2019).

[Boemer 2020] Boemer, F., Cammarota, R., Demmler, D., Schneider, T., Yalame, H.: Mp2ml: A mixed-protocol machine learning framework for private inference. In: Proceedings of the 15th International Conference on Availability, Reliability and Security, pp. 1–10 (2020).

[Chen 2022] Chen, T., Bao, H., Huang, S., Dong, L., Jiao, B., Jiang, D., Zhou, H., Li, J., Wei, F.: The-x: Privacy-preserving transformer inference with homomorphic encryption. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3510–3520 (2022)

[Chu 2021] Chu, Xiangxiang & Tian, Zhi & Wang, Yuqing & Zhang, Bo & Ren, Haibing & Wei, Xiaolin & Xia, Huaxia & Shen, Chunhua. (2021). Twins: Revisiting Spatial Attention Design in Vision Transformers, NeurIPS 2021.

[Dosovitskiy 2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2020).

[Dutta 2021] Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti, Tanmoy Chakraborty: Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems. NeurIPS 2021: 5531–5544.

[Gentry 2009] Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing, pp. 169–178 (2009)

[Guibas 2021] Guibas, J., Mardani, M., Li, Z., Tao, A., Anandkumar, A., Catanzaro, B.: Efficient token mixing for transformers via adaptive Fourier Neural operators. In: International Conference on Learning Representations (2021).

[Hendrycks 2016] Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers with gaussian error linear units (2016).

[Huang 2020] Huang, Y., Song, Z., Chen, D., Li, K., Arora, S.: Texthide: Tackling data privacy in language understanding tasks. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1368–1382 (2020).

[Hugo 2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Jegou, H. (2021). Training data-efficient image transformers distillation through attention, Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research, 139:10347–10357 Available from https://proceedings.mlr.press/v139/touvron21a.html.

[Kevin 2022] Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch, “Frozen Pretrained Transformers as Universal Computation Engines,” Association for Advancement of Artificial Intelligence, AAAI 2022.

[Kerrigan 2020] Kerrigan, G., Slack, D., Tuyls, J.: Differentially private language models benefit from public pre-training. In: Proceedings of the Second Workshop Springer Nature 2021 LATEX template Survey on Efficient Transformers: Model, Datasets, and Evalution methods 25 on Privacy in NLP, pp. 39–45 (2020).

[Kolesnikov 2020] Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N.: Big transfer (bit): General visual representation learning. In: European Conference on Computer Vision, pp. 491–507 (2020). Springer.

[Lee-Thorp 2022] Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: Fnet: Mixing tokens with Fourier Transforms, Proceedings of the North Americal Chapter of Association for Computational Linguistics NAACL, 2022.

[Li 2020] Li, Z., Kovachki, N.B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A., et al.: Fourier neural operator for parametric partial differential equations. In: International Conference on Learning Representations (2020).

[Li 2022] Li, X., Tramer, F., Liang, P., Hashimoto, T.: Large language models can be strong differentially private learners. In: International Conference on Learning Representations (2022).

[Liu 2021] Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. “Swin transformer: Hierarchical vision transformer using shifted windows.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. 2021.

[Manzil 2020] Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big bird: transformers for longer sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 1450, 17283–17297.

[Park 2022] Park, Namuk and Songkuk Kim. “How Do Vision Transformers Work?” ArXiv abs/2202.06709 (2022), proceedings of International Conference on Learning Representations (ICLR 2022).

[Paul 2022] Paul, S., Chen, P.-Y.: Vision transformers are robust learners. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2071–2081 (2022).

[Rao 2021] Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. Advances in Neural Information Processing Systems 34, 980–993 (2021).

[Ruthotto 2019] Ruthotto, L., & Haber, E. (2019). Deep Neural Networks Motivated by Partial Differential Equations. Journal of Mathematical Imaging and Vision, 62, 352–364.

[Sang 2003] Sang, E.T.K., De Meulder, F.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003).

[Shao 2022] Shao, Rulin, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen and Cho-Jui Hsieh. “On the Adversarial Robustness of Vision Transformers.” ArXiv abs/2103.15670 (2021).

[Sinong 2020] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma: Linformer: Self-Attention with Linear Complexity. CoRR abs/2006.04768 (2020).

[Stanislaw 2020] Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, Krzysztof J. Geras: The Break-Even Point on Optimization Trajectories of Deep Neural Networks. CoRR abs/2002.09572 (2020).

[Tay 2021a] Tay, Yi, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. “Long range arena: A benchmark for efficient transformers.” International Conference on Learning Representations (2021).

[Tay 2022] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient Transformers: A Survey. ACM Computing Surveys, Just Accepted (April 2022). https://doi.org/10.1145/3530811

[Wang 2018] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations (2018).

[Wang 2021] Wang, Wenhai, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578. 2021.

[Wei 2022] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent Abilities of Large Language Models. ArXiv, abs/2206.07682.

[Wu 2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022).

[Yu 2022] Yu, D., Naik, S., Backurs, A., Gopi, S., Inan, H.A., Kamath, G., Kulkarni, J., Lee, Y.T., Manoel, A., Wutschitz, L., et al.: Differentially private fine-tuning of language models. In: International Conference on Learning Representations (2022).

[Ze 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. IEEE International Conference Computer Vision, pages 10012–10022, 2021.