Is ALBERT short for BERT?
Getting to know the differences between two of the most revolutionary state of the art algorithmic flows in Natural Language Processing.
BERT Algorithm is considered as a revolution in word semantic representation which has outperformed all the previously known word2vec models in various NLP tasks such as Text Classification, Entity Recognition, and Question-Answering.
The original BERT (BERT-base) model is made of 12 transformer encoder layers along with a Multi-head Attention.
The pretrained model has been trained on a large corpus of unlabeled text data with self-supervising by using the following tasks:
1. Mask Language Model (NLM) loss — The task is “the filling banks,” where a model uses the context words surrounding a MASKED token to try to predict what the MASKED word should be.
2. Next Sentence Prediction (NSP) loss — For an input of sentences (A, B), it estimates how likely sentence B is the second sentence in the original text. This mechanism can be a beneficial evaluation metric in conversational systems’ performance.
RoBerta and XLNet are new versions of Bert that outperform original BERT on many benchmarks using more data and new NLM-loss, respectively.
BERT is an expensive model in terms of memory and time consumed on computations, even with GPU. The original BERT contains 110M parameters to be fine-tuned, which takes a considerable amount of time to train the model and excellent memory to save the model’s parameters. Therefore, we prefer lighter algorithms with excellent performance as BERT. So we shall talk about a recent article that introduces a new version of BERT named ALBERT. The authors of ALBERT claim that their model brings an 89% parameter reduction compared to BERT with almost the same performance on the benchmark. We will compare ALBERT with BERT to see whether it can be a good replacement for BERT.
The pretrained ALBERT model comes in two versions: “Albert-base-v1” (Not-recommended) and “Albert-base-v2” (Recommended) that can be downloaded from Hugging Face website containing all models in the Bertology domain. You can also load the model directly in your code by using the transformers module as follows:
from transformers import AlbertTokenizer, AlbertModeltokenizer = AlbertTokenizer.from_pretrained(“albert-base-v2”)
model = AlbertModel.from_pretrained(“albert-base-v2”)
And by using this link, you can find the model and the codes for performing different tasks on benchmark data in the paper.
First, we look at the innovations in ALBERT, which are the reasons that named this algorithm as “A Lite BERT.” We then discuss the question: Is ALBERT solving memory and time consumption issues of BERT?
Innovations in ALBERT
1. Cross-layer parameter sharing is the most significant change in BERT architecture that created ALBERT. ALBERT architecture still has 12 transformer encoder blocks stacked on top of each other like the original BERT. Still, it initializes a set of weights for the first encoder that is repeated for the other 11 encoders. This mechanism reduces the number of “unique” parameters, while the original BERT contains a set of unique parameters for every encoder (see Figure 1).
People who are familiar with fundamentals of Deep Learning know that every layer of a Neural Networks model is responsible for catching certain features or patterns of data and the deeper layers learn more complicated patterns and concepts, and to make this happen, each layer should contain its specific parameters different independent from other layers’. Therefore, one can conclude that this architecture can not outperform BERT architecture, and as you see in the following table, the shared parameters do not leverage the accuracy significantly, but interestingly, the results are almost the same as BERT.
2. Embedding Factorization The embedding size in BERT is equal to the size of the hidden layer (768 in original BERT). ALBERT adds a smaller size layer between vocabulary and hidden layer to decompose the embedding matrix of size |V|x|H| (between the vocabulary of size |V| and a hidden layer of size |H|) into two small matrices of size |V|x|E| and |E|x|H|. This idea reduces the number of parameters between vocabulary and the first hidden layer from O(|V|x|H|) to O(|V|x|E| + |E|x|H|), where |E| is the size of the new embedding layer between the hidden layer and vocabulary (see Figure 2).
3. Sentence order prediction (SOP) predicts “+1” for consecutive pairs of sentences in the same document, and it predicts “-1” if the order of sentences is swapped or sentences are from separate documents. The idea is to replace the NSP loss by SOP loss. SOP loss will leverage topic prediction in BERT to coherence prediction in ALBERT. As you see in Table 2, ABERT slightly surpasses NSP loss by using SOP loss on four benchmarks, particularly on Stanford Question Answering (SQuAD).
Discussion
We aimed to focus on memory issues, time consumed on training the model, and whether ALBERT fixes these issues? Table 2 shows the number of parameters in BERT versus ALBERT.
Is Albert significantly reducing the training time consumption? The answer is no because the authors only mentioned the number of unique parameters. Still, these parameters are repeated 12 times for each encore block, and the model performs backpropagation on all repeated parameters. For example, if you look at Albert-XXlarge, it has 235M parameters, but this is the number of the shared parameters per encoder, and the real number of parameters that pass through the backpropagation process is indeed 235Mx12 = 2.82B parameters in total. The only small difference might come from the embedding matrix that is containing a lower number of parameters as we explained above.
Is Albert reducing the memory issues? The answer is yes since the parameter sharing part of the ALBERT helps us to store only one transformer block instead of 12 transformers. Therefore, the size of the stored model will be much smaller than the original BERT.
The Last Word
While I was reading the abstract, I thought this could mean a breakthrough in transformers which helps us to have lighter models with almost the same performance as BERT. Still, after going through further details, it is not clear to me if this kind of repeated transformers work well on real problems and can capture different concepts of text data by only one set of repeated parameters. The fine-tuning might become much harder in less synthetic settings.
ALBERT is a proof of concept which brings promising results, and I still appreciate this direction and believe similar solutions might become more critical in the future. If these kinds of ideas can be used on a larger scale in the industry, they would be a perfect replacement for compact models implemented in small Embedded Devices like medical devices, phones, and the internet of things that require lighter models in terms of memory allocation.