Google BERT — Pre Training and Fine Tuning for NLP Tasks
A recently released BERT paper and code generated a lot of excitement in ML/NLP community¹. BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus ( BooksCorpus and Wikipedia), and then use that model for downstream NLP tasks ( fine tuning ) that we care about (like question answering — SQuAD).
Models preconditioned with BERT achieved better than human performance on SQuAD 1.1 and lead on SQuAD 2.0³. BERT relies on massive compute for pre-training ( 4 days on 4 to 16 Cloud TPUs; pre-training on 8 GPUs would take 40–70 days). Typical uses would be fine tuning BERT for a particular task or for feature extraction.
BERT naturally builds on .. ELMo, continuing in language model direction ( as opposed to shallow word2vec, GLoVe i.e. word embedding approaches). It is generates contextual, bidirectional representations⁴.
BERT proposes a new training objective: the “masked language model” (MLM) . The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of
the masked word based only on its context.
The basic BERT building block is the Transformer ( as opposed to RNN based options like BiLSTM ). Central to the Transformer is the notion of attention —dynamic highlight of query vector similar key, value pairs.
Transformer² is simpler, more parallelizable ( GPU friendly ) i.e. faster than RNN — it uses only straightforward matrix multiplication and simple few layer feed forward neural network with no recurrence and no weight sharing:
BERT sentence classification demo is available for free on Colab Cloud TPU. Generic BERT model is here fine tuned for MRPC task( determines if sentence pairs are semantically equivalent ).
For example, if input sentences are:
Ranko Mosic is one of the world foremost experts in Natural Language Processing arena. In a world where there aren’t that many NLP experts, Ranko is the one.
The model will conclude these two sentences are equivalent ( label = 1 ).
We are now proceeding with more detailed analysis⁴ that includes code snippets and variable dumps.
The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor’s to supply the deficiency.
The above sample sentence is Wordpiece tokenized⁵ ( following initial basic tokenization — converting all tokens to lower case, punctuation split ) into:
[‘the’, ‘chips’, ‘from’, ‘his’, ‘wood’, ‘pile’, ‘refused’, ‘to’, ‘kind’, ‘##le’, ‘a’, ‘fire’, ‘to’, ‘dry’, ‘his’, ‘bed’, ‘-’, ‘clothes’, ‘,’, ‘and’, ‘he’, ‘had’, ‘rec’, ‘##ours’, ‘##e’, ‘to’, ‘a’, ‘more’, ‘provide’, ‘##nt’, ‘neighbor’, “‘“, ‘s’, ‘to’, ‘supply’, ‘the’, ‘deficiency’, ‘.’]
Below is layout of a final record written to the output file. 15% of tokens are randomly masked⁶, segmentation id is added(0 or 1 i.e A or B, padded to 128 — max segment length; segments/sentences can have content from different actual sentences); sentences are randomly shuffled and randomized next_sentence label is added.
Embedding starts⁷ with randomly initialized embedding_table ( modeling.py ); shape is (30522, 768) i.e. (vocab_size, embedding vector size ):
BERT uses tf.nn.embedding_lookup(embedding_table, input_ids) to match each input token_id ( input_id ) with initial random 768 dimensional embedding:
The next step is to modify initial random embedding with positional encoding. BERT GitHub Python code does not implement positional embedding updates as described in 3.5 of Attention is All You Need⁸.
It starts with token_type_table:
token_type_table.shape is (2, 768) — two randomly generated 768 dimensional vectors:
[[0.00760975899 -0.0229291413 -0.0163771752…]…]
The initial batch of token_type_ids ( segment ids of shape 8, 128 ) is flattened into one 1024 vector, converted into one-hot representation, matrix multiplied with token_type_table and added ( matrix addition ) to the output:
Output is further salted with randomly initialized positional embeddings⁹:
We are moving on to the next part — Encoder block , Multi-Head Attention:
The first step is to create attention mask ( attention_mask is 1.0 for positions we want to attend and 0.0 for masked positions )— procedure below returns Tensor(“bert/encoder/mul:0”, shape=(8, 128, 128).
Below is a call to transformer_model:
BERT Transformer has configurable ( bert_config.json ) number of self-attention heads (it is self-attention because from_tensor, to_tensor are the same — layer_input with shape (1024, 768) ):
from_tensor, to_tensor are transformed to query_layer, key_layer and value_layer via tf.layer.dense¹⁰:
This is the core moment ( Scaled Dot-Product Attention on Figure 2 below ) — dot product similarity ( attention i.e. attention_score) between query and key is calculated and softmaxed — converted to probabilities ( attention probabilities that add up to 1 ):
A standard dropout is applied ( with keep probability 1.0 – 0.1 = 0.9 ):
Finally scaled dot-product attention is matrix multiplication of attention_probs and value_layer:
We have now completed computation for a single attention layer¹¹.
Next comes feed forward part that is split into three layers of neural nets; only intermediate step has gelu activation function; outer layers feature dropout and layer normalization¹² ( for faster training ); layer_outputs with shape of (1024, 768) are appended to the all_layers list ( each layer_output is one of Nx layers on the Figure 1 above ):
layer_outputs are finally brought back to their original shape ( (32, 128, 768):
Green marked area shows Transformer encoder part we have covered so far. We are now moving on to the red arrow marked entry to the Transformer decoder part ( both encoder and decoder also feature BERT specific token masking):
get_masked_lm_output is a function call that has as its inputs:
- bert_config parameters
- model.get_sequence_output(); this is the last layer of transformer_model output of a call to transformer_model above
self.sequence_output = self.all_encoder_layers[-1]
- model.get_embedding_table() is token embedding ( explained above )
- masked_lm_positions, masked_lm_ids, masked_lm_weights are tensors of shape (8, 20) that contain masked token positions, ids and weights
¹ NY Times wrote about BERT
² A quick way to try Transformer is Google’s Tensor2Tensor library of models and datasets.
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by…rajpurkar.github.io
⁵ Tokenizes a piece of text into its word pieces. For example, “unaffable” = [“un”, “##aff”, “##able”]; uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. Sentences are randomly shuffled. Each token is assigned token_id ( unique accross all segments ); for example, ‘wood’ token_id is 3536
⁶ Please refer to page 6 of BERT paper for more details on why and how masking is done
⁷ Next, we get the embedding for each word in the sequence. Each word of the sequence is mapped to a emb_dim dimensional vector that the model will learn during training. You can think about it as a vector look-up for each token. The elements of those vectors are treated as model parameters and are optimized with back-propagation just like any other weights ( Dissecting Bert )
⁸ We assume below described PE is implemented in BERT C++ code.
⁹ This is obviously not a proper PE implementation as no learning occurs and PE is reinitialized on each loop.
¹⁰ This layer implements the operation ( a standard NN layer ):
outputs = activation(inputs * kernel + bias) where
activation is the activation function passed as the
activation argument (if not
kernel is a weights matrix created by the layer, and
bias is a bias vector created by the layer (only if
¹¹ Multi-head attention layers should be concatenated: On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2 ( Attention Is All You Need ). This code exists in Github BERT Python code, but is never executed:
attention_output = tf.concat(attention_heads, axis=-1)
Parallel multi-head execution is Transformer’s major point. Github code is all sequential though ( parallelization is perhaps achieved in deployment step ? )
¹² layer_norm is implemented via tf.contrib that will be removed from core TF 2.0 build process: