ConvNets vs Vision Transformers: Mathematical Deep Dive.

Freedom Preetham
Autonomous Agents
Published in
6 min readOct 30, 2023

I have been witnessing this debate on Vision Transformers on how they are as good or better than CNNs. I wonder if we debate the same on how pineapples are better than watermelons! Or horses are better than dolphins? Many of these discussions often lack specificity and can sometimes misrepresent the context.

As a back drop, in the rapidly evolving landscape of deep learning, two architectures stand out for image “classification” tasks: Convolutional Neural Networks (ConvNets) and Vision Transformers (ViTs). While practitioners often interchangeably use them for classification, their mathematical underpinnings are distinct.

In this article I venture deep into the mathematics of these architectures, shedding light on their functional equivalence for classification and differences in generative tasks. I also provide a mathematical comparison on how the budgets either converge or differ based on context.

Delving into Non-Generative Functional Equivalence

1. The Hierarchical Feature Space

ConvNets:

Given an input I and filters {Fk​}, convolution is defined as:

Stacking these convolutions:

where σ is an activation function, and bk​ is a bias term.

ViTs:

Tokens undergo self-attention:

The entire sequence evolves as:

Both architectures build upon their inputs layer by layer, modeling hierarchical patterns.

2. Injecting Non-Linearity

ConvNets:

ReLU is commonly used:

ViTs:

GELU is typical in transformers:

These non-linearities ensure that the models can capture complex patterns.

3. Efficiency in Parameterization for Classification

ConvNets:

Due to weight sharing:

ViTs:

Despite their quadratic growth with sequence length, linear approximations like Linformer:

Both delineate the feature space, forming robust decision boundaries.

So far we learnt that while ConvNets and Vision Transformers have distinct mathematical foundations, they exhibit remarkable functional equivalence in classification tasks. Their methodologies in capturing hierarchical patterns and classifying them effectively make them prime choices for practitioners.

Non-Generative Training Budgets: Where They Align

In non-generative tasks, primarily classification, the training budgets for both architectures exhibit remarkable similarities. Let’s explore this mathematically.

1. Computational Complexity

ConvNets:

The computational cost of a convolutional layer is:

Where K is the filter size, and M×N is the feature map size.

ViTs:

For self-attention:

Where L is the sequence length and N is the feature dimension.

In practice, for large-scale datasets and deep networks, these complexities tend to converge, especially when using efficient transformer variants like Linformer or Performer.

2. Memory Footprint

ConvNets:

Due to weight sharing, the memory required is:

Where D_in​ and D_out​ are the input and output depths.

ViTs:

The memory cost is:

Again, with efficient variants and optimizations, the memory footprints align closely for large-scale classification tasks.

As we can see, mathematically, ConvNets and Vision Transformers converge in terms of training budgets for non-generative tasks for large-scale classification tasks.

Generative Tasks: The Divergence

1. Spatial Coherence

ConvNets:

They inherently maintain spatial coherence:

ViTs:

ViTs require positional embeddings:

While ConvNets produce naturally smooth images, ViTs may need added constraints.

2. Sequential Data Generation

ConvNets:

In architectures like PixelCNN:

ViTs:

Transformers handle sequences naturally:

ViTs have the edge in naturally generating sequences, while ConvNets need specific designs.

3. Latent Space Dynamics

ConvNets:

In VAE structures:

ViTs:

Potential for richer latent spaces:

ViTs might capture intricate latent spaces due to their self-attention mechanism, while ConvNets may need more intricate designs.

As we can see, when venturing into generative domains, their inherent biases manifest distinctly.

Generative Budget Scenarios: Where Disparities Arise

When it comes to generative tasks, the training budgets for ConvNets and ViTs start to diverge significantly.

1. Spatial Coherence and Continuity

ConvNets:

The inherent spatial structure ensures locally coherent outputs. Thus, fewer training iterations might be needed to achieve high-quality images:

Where ϵ_conv​ is the convergence rate for ConvNets.

ViTs:

ViTs, lacking inherent spatial bias, might require additional training iterations to ensure local coherence in generated images:

Where ϵ_vit​ is typically smaller than ϵ_conv​ due to the absence of spatial priors.

2. Latent Space Exploration

ConvNets:

The exploration of the latent space in generative models like VAEs is direct:

Where Z is the dimensionality of the latent space.

ViTs:

Given the self-attention mechanism, ViTs might exhibit a richer exploration of the latent space, but with a potentially higher computational cost (this gets offset in sequential dependencies):

3. Sequential Dependencies

ConvNets:

While adaptable, ConvNets aren’t inherently sequential. Thus, modeling sequential dependencies might require more intricate designs and potentially longer training (This is where ViTs beat ConvNets on generative usecases):

Where τ is sequence length and δ_conv​ is an iteration factor.

ViTs:

Given their origin in NLP, ViTs naturally handle sequences, potentially reducing the required training iterations:

Where δ_vit​ is typically smaller than δ_conv​.

The landscape changes dramatically in generative scenarios. ConvNets, with their spatial priors, may offer advantages in image generation, while ViTs, with their global attention, might be more suited for tasks like text or multi-modal-domain generation with probably lesser budgets. Again this is significantly based on the context of use and the dataset size. The context matters.

Hopefully, this mathematical deep dive provides a lens to appreciate the strengths and challenges of both architectures, guiding researchers in their choices for diverse tasks based on domain and context.

--

--