Improving Inference Speeds of Transformer Models

“With great models comes slower inference speeds”.

Mixed Precision Training

Figure 1: Mixed precision training iteration for a layer. Source: Mixed Precision Training
Figure 2: Histogram of activation gradient values during the training of Multibox SSD network. Source: Mixed Precision Training

Patience Based Early Exit

Patience-based Early Exit (PABEE) Image Source

although the model becomes more “confident” with its prediction as more layers join, the actual error rate instead increases after 10 layers. This phenomenon was discovered and named “overthinking” by Kaya et al.

Analogy Between Overfitting and Overthinking Image Source

overfitting in training and overthinking in inference are naturally alike, inspiring us to adopt an approach similar to early stopping for inference.

Knowledge Distillation