Video Swin Transformer Improves Speed-Accuracy Trade-offs, Achieves SOTA Results on Video Recognition Benchmarks

Transformer architectures are transforming computer vision. Introduced in 2020, the Vision Transformer (ViT) globally connects patches across spatial and temporal dimensions, and has largely replaced convolution neural networks (CNNs) as the modelling choice for researchers in this field.