Understanding the Hidden Bias of Transformers in Machine Learning

A General Summary Without Any Math.

Freedom Preetham
Autonomous Agents
4 min readJul 21, 2024

--

Transformers have garnered significant attention for their unparalleled flexibility and performance across a multitude of tasks. Traditionally, it has been believed that these models possess weak inductive biases, implying that they do not inherently favor any specific kind of data structure and hence require vast amounts of data to learn effectively. However, recent research suggests that this might not be entirely true.

I was reading this paper “Towards Understanding Inductive Bias in Transformers: A View From Infinity” which challenges the conventional wisdom by revealing that Transformers have a more nuanced inductive bias than previously thought. I like the mathematical treatment in this paper where they show that we cannot generalize the statement that transformers have weak inductive bias and it is more nuanced that that.

I still believe Transformers generally can be categorized to have weak inductive bias (compared to other models). I will write a separate article which will be math based to highlight the nuances in this paper for comparison. For now, let me not bias you. Nevertheless, the arguments in this paper is well backed with proofs.

Here is a non mathematical summary of the paper for general readers. I will deep dive on math from this paper in a separate article.

Inductive Bias in Machine Learning

Inductive bias refers to the set of assumptions a model makes about the data to enable it to generalize from limited examples. Models with strong inductive biases are designed with built-in assumptions that make them particularly adept at learning specific patterns or structures. Conversely, models with weak inductive biases are highly flexible and can adapt to a wide range of tasks but often require more data to achieve the same level of performance.

Key Insights from the Paper

Permutation Symmetry Bias
The paper argues that Transformers actually have a bias towards permutation symmetric functions. This means that they are naturally inclined to favor functions or patterns that do not change when the order of input elements (tokens) is shuffled. This is contrary to the belief that Transformers have weak inductive biases.

Transformers have a natural preference for patterns that remain the same even when the order of parts changes. Imagine a list of words: “cat, dog, bird.” If you shuffle it to “dog, bird, cat,” a Transformer still recognizes the same overall pattern. This contradicts the previous belief that Transformers don’t prefer any specific patterns.

Representation Theory of Symmetric Group
The authors use mathematical tools from the representation theory of the symmetric group to show that Transformers tend to be biased towards these symmetric functions. They provide quantitative analytical predictions showing that when the dataset possesses a degree of permutation symmetry, the learnability of the functions improves.

Example: Think of a set of building blocks. No matter how you arrange them, the structure remains the same. Transformers can quickly recognize and learn these types of structures.

Gaussian Process Limit
By studying Transformers in the infinitely over-parameterized Gaussian process (GP) limit, the authors show that the inductive bias can be seen as a concrete Bayesian prior. In this limit, the inductive bias of the Transformer becomes more apparent and can be analytically characterized.

Example: Imagine a Transformer as a giant library with every possible book. When you understand how the library is organized, you can find any book easily. Similarly, understanding the Transformer’s bias helps it learn faster.

Learnability and Scaling Laws
The paper presents learnability bounds and scaling laws that relate to how easily a Transformer can learn a function based on the context length and the degree of symmetry in the dataset. It shows that more symmetric functions (functions invariant to permutations) require fewer examples to learn.

Example: If you’re teaching a child to recognize shapes, they learn faster if the shapes are always the same, regardless of how they are arranged on a page. Similarly, Transformers learn shuffle-resistant patterns quickly.

Empirical Evidence
The authors also provide empirical evidence from the WikiText dataset, showing that natural language possesses a degree of permutation symmetry. This supports their theoretical findings and suggests that Transformers are particularly well-suited to tasks involving natural language because of this inherent symmetry bias.

Example: When reading a sentence, the meaning often remains the same even if you change the word order slightly, like “The cat sat on the mat” and “On the mat, the cat sat.” Transformers excel in understanding such text patterns.

Implications for Machine Learning

Transformers’ Bias
This paper suggests that Transformers do have an inductive bias, specifically towards permutation symmetry. This means they are not as bias-free as previously thought and have a natural tendency to favor certain types of patterns.

Practical Application
Understanding this bias can help in designing better models and training regimes that leverage this property. For instance, knowing that Transformers excel at learning symmetric patterns can influence how we preprocess data or how we structure tasks for these models.

--

--