Understanding BigBird | Is Google BigBird gonna be the new leader in the NLP domain?

5 min readSep 12, 2020

In 2018, Google released a big fish in the market termed BERT(Bidirectional Encoder Representations from Transformers). It brought state-of-the-art accuracy on NLP and NLU tasks. The distinguishing factor was “Bidirectional Training” which was proved to be holding great potential in giving a deeper sense of language context. Before BERT, models used to analyze text sequences either from left to right or combined left-to-right and right-to-left training.

Recently, Google has released another breakthrough in the domain by introducing BigBird.

What is new?

BigBird has eradicated the problem of the “quadratic resource requirement” of attention mechanism which was the main roadblock in scaling up the transformers to long sequences.

It replaces the full quadratic attention mechanism with a mix of random attention, window attention, and global attention. Thereby proposing a linear memory requirement instead of a quadratic memory requirement. Not only does this allow the processing of longer sequences unlike BERT. But, the paper also shows that BigBird comes with theoretical guarantees of universal approximation and turing completeness.

What is the Quadratic memory requirement?

It was proved that the models using the transformer perform much better than the alternatives. BERT is essentially based on the full attention mechanism. Let's assume, ‘n’ tokens are required to be processed by the model. As per the full attention model, the result of the node should be better than the corresponding token. Therefore, ’n’ nodes are required to be fully connected to the ‘n’ tokens which are to be processed. This leads to n² amount of computation and memory requirements.

Full attention mechanism small illustration

Therefore, this leads to a limitation on the tokens. BERT can analyze 512 tokens at a time. BigBird reduces these quadratic dependencies by the introduction to a combination of random attention, window attention, and global attention which thereby leads to linear memory requirement.

What are the different forms of attention mechanisms used in BigBird?

BigBird uses a combination of the following three types of attention mechanisms:

Building blocks of the attention mechanism used in BigBird. White color indicates the absence of attention. (a) random attention with r = 2, (b) sliding window attention with w = 3 ( c) global attention with g = 2. (d) the combined BigBird model (https://arxiv.org/abs/2007.14062)

Random Mechanism(Path lengths are logarithmic): With random parameter(r) equal to 2, every token is connected with randomly chosen 2 nodes. which reduces the computation complexity to O(r*n). In case of full attention, information can flow from each node to every other node as they are fully connected. But under random attention, then it will take more steps to go from one node to another as all are not connected.
Window Mechanism(Neighbors are important): With a given window parameter(w) = 2, every node is connected to itself but also connected to its neighbors. The number of neighbors it will be connected on each side is w/2. Since w is a constant, the complexity reduces to w*n.
Global Mechanism(star shape network): It involves the addition of a node to each layer where each new node receives input from the previous layer and sends output to all the nodes in the next layer.

Window and global attention are used in the “long former” as well but BigBird incorporates random attention along with the “long former”. Random attention adds more connection to the overall mechanism which will improve the results of the model(More attention is better).

Increase in input sequence length

In order to make this sparse attention more effective, the model requires more number of layers as compared to full attention.

Hyperparameters for the two BigBird base models for MLM (https://arxiv.org/abs/2007.14062)

Here, analyzing the hyperparameters of the BIGBIRD-ITC,

block length, b = 64

Number of global tokens, g = 2*b

Window length, w = 3*b

Number of random tokens, r = 3*b

therefore, the total is 8*b = 512

Whereas, the number of hidden layers = 12, Hidden layers size(number of nodes within each hidden layer) = 768

And finally this huge architecture is able to process a maximum of 4096 token sequences.

Whereas, in the case of BIGBIRD-ETC, they have used 0 random tokens and are utilizing global tokens more. But, the random attention is the key component that makes BigBird different from the Long Former.

Results

BigBird has overpowered Longformer and BERT as depicted below. Whereas, out of 7 cases, BIGBIRD-ETC(where the random token is 0) has performed better for 5 cases.

As compared to the standard models(presented below), BigBird is able to reserve its place and is proven to be the new State-of-the-art for Natural Questions Long Answer (LA), TriviaQA Verified, and WikiHop.

Fine-tuning results on Test set for QA tasks (https://arxiv.org/abs/2007.14062)

To conclude, BigBird is a sparse attention mechanism(introduction of random attention) that has converted the quadratic memory requirements by full attention model to linear requirement. The paper has provided several theoretical proves to the claim. But, complexities might arise in terms of more number of layers required(due to sparse connection). Whereas, the outperforming BIGBERT-ETC(state-of-the-art) model does not use random attention.

References:

Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP…

arxiv.org

google-research/bert

This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in…

github.com