BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus ( BooksCorpus and Wikipedia), and then use that model for downstream NLP tasks ( fine tuning )¹⁴ that we care about.
Models preconditioned with BERT achieved better than human performance on SQuAD 1.1 and lead on SQuAD 2.0³. BERT relies on massive compute for pre-training ( 4 days on 4 to 16 Cloud TPUs; pre-training on 8 GPUs would take 40–70 days i.e. is not feasible. BERT fine tuning tasks also require huge amounts of processing power, which makes it less attractive and practical for all but very specific tasks¹⁸ ). Typical uses would be fine tuning BERT for a particular task or for feature extraction.
BERT generates multiple, contextual, bidirectional word representations, as opposed to its predecessors (word2vec, GLoVe ).
BERT proposes a new training objective: the “masked language model” (MLM)¹³ . The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of
the masked word based only on its context.
Transformer is simpler, more parallelizable ( GPU friendly ) i.e. faster than RNN — it uses only straightforward matrix multiplication and simple few layer feed forward neural network with no recurrence and no weight sharing. BERT only implements Transformer encoder part ¹⁶:
BERT sentence classification demo is available for free on Colab Cloud TPU. BERT language model is fine tuned for MRPC task( sentence pairs semantic equivalence ).
For example, if input sentences are:
Ranko Mosic is one of the world foremost experts in Natural Language Processing arena. In a world where there aren’t that many NLP experts, Ranko is the one.
The model will conclude these two sentences are equivalent ( label = 1 ).
The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor’s to supply the deficiency.
The above sample sentence is Wordpiece tokenized⁵ ( following initial basic tokenization — converting all tokens to lower case, punctuation split ) into:
[‘the’, ‘chips’, ‘from’, ‘his’, ‘wood’, ‘pile’, ‘refused’, ‘to’, ‘kind’, ‘##le’, ‘a’, ‘fire’, ‘to’, ‘dry’, ‘his’, ‘bed’, ‘-’, ‘clothes’, ‘,’, ‘and’, ‘he’, ‘had’, ‘rec’, ‘##ours’, ‘##e’, ‘to’, ‘a’, ‘more’, ‘provide’, ‘##nt’, ‘neighbor’, “‘“, ‘s’, ‘to’, ‘supply’, ‘the’, ‘deficiency’, ‘.’]
Below is layout of a final record written to the output file. 15% of tokens are randomly masked⁶, segmentation id is added(0 or 1 i.e A or B, padded to 128 — max segment length; segments/sentences can have content from different actual sentences); sentences are randomly shuffled and randomized next_sentence label is added.
Embedding starts⁷ with randomly initialized embedding_table ( modeling.py ); shape is (30522, 768) i.e. (vocab_size, embedding vector size ):
BERT uses tf.nn.embedding_lookup(embedding_table, input_ids) to match each input token_id ( input_id ) with initial random 768 dimensional embedding:
Multi-head attention starts with attention mask ( 1.0 for positions we want to attend to and 0.0 for masked positions )— procedure below returns Tensor(“bert/encoder/mul:0”, shape=(8, 128, 128).
Below is a call to transformer_model:
BERT Transformer has configurable ( bert_config.json ) number of self-attention heads (it is self-attention because from_tensor, to_tensor are the same — layer_input with shape (1024, 768) ):
from_tensor, to_tensor are transformed to query_layer, key_layer and value_layer via tf.layer.dense¹⁰:
This is the core moment ( Scaled Dot-Product Attention on Figure 2 below ) — dot product similarity ( attention i.e. attention_score) between query and key is calculated:
A standard dropout is applied ( with keep probability 1.0 – 0.1 = 0.9 ):
Finally scaled dot-product attention is matrix multiplication of attention_probs and value_layer:
We have now completed computation for a single attention layer¹¹.
Next comes feed forward part that is split into three layers of neural nets; only intermediate step has gelu activation function; outer layers feature dropout and layer normalization¹² ( for faster training ); layer_outputs with shape of (1024, 768) are appended to the all_layers list ( each layer_output is one of Nx layers on the Figure 1 above ):
layer_outputs are finally brought back to their original shape ( (32, 128, 768):
¹ NY Times wrote about BERT. In a nutshell BERT is a humongous encoder — it features state of the art contextual representation of a huge text corpus: Wikipedia/BookCorpus -> BERT -> word encodings ( model i.e. weights ).
..“encoder-only” models like BERT are designed
to produce a single prediction per input token or a single prediction for an entire input sequence. This makes them applicable for classification or span prediction tasks but not for generative tasks like translation or abstractive summarization
The Stanford Question Answering Dataset
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by…
⁵ Tokenizes a piece of text into its word pieces. For example, “unaffable” = [“un”, “##aff”, “##able”]; uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. Sentences are randomly shuffled. Each token is assigned token_id ( unique accross all segments ); for example, ‘wood’ token_id is 3536
⁶ Please refer to page 6 of BERT paper for more details on why and how masking is done
⁷ Next, we get the embedding for each word in the sequence. Each word of the sequence is mapped to a emb_dim dimensional vector that the model will learn during training. You can think about it as a vector look-up for each token. The elements of those vectors are treated as model parameters and are optimized with back-propagation just like any other weights ( Dissecting Bert )
¹⁰ This layer implements the operation ( a standard NN layer ):
outputs = activation(inputs * kernel + bias) where
activation is the activation function passed as the
activation argument (if not
kernel is a weights matrix created by the layer, and
bias is a bias vector created by the layer (only if
¹¹ Multi-head attention layers are concatenated:
attention_output = tf.concat(attention_heads, axis=-1)
¹³ This is an example of self-supervised learning
¹⁴ Real life BERT based applications are: Google search improvement ( Oct 25, 2019 update — I am guessing BERT is used in supervisory role for search results reranking : In a recent talk at Google Berlin, Jacob Devlin described how Google are (sic) using his BERT architectures internally. The models are too large to serve in production, but they can be used to supervise a smaller production model ); sentiment analysis, classification
¹⁵ Interview with BERT first author Jacob Devlin
¹⁶ GPT-2 is using Transformer decoders.
¹⁷ Input sequence is split into vectorized tokens; logically each token is a query that is correlated with the rest of the tokens —keys ( and their corresponding values ).
¹⁸ General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train.