Google BERT — Pre Training and Fine Tuning for NLP Tasks

A recently released BERT paper and code generated a lot of excitement in ML/NLP community¹. BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus ( BooksCorpus and Wikipedia), and then use that model for downstream NLP tasks ( fine tuning ) that we care about (like question answering — SQuAD).

Models preconditioned with BERT achieved better than human performance on SQuAD 1.1 and lead on SQuAD 2.0³. BERT relies on massive compute for pre-training ( 4 days on 4 to 16 Cloud TPUs; pre-training on 8 GPUs would take 40–70 days). Typical uses would be fine tuning BERT for a particular task or for feature extraction.

BERT naturally builds on .. ELMo, continuing in language model direction ( as opposed to shallow word2vec, GLoVe i.e. word embedding approaches). It is generates contextual, bidirectional representations⁴.

BERT proposes a new training objective: the “masked language model” (MLM) . The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of
the masked word based only on its context.

The basic BERT building block is the Transformer ( as opposed to RNN based options like BiLSTM ). Central to the Transformer is the notion of attention —dynamic highlight of query vector similar key, value pairs.

Transformer² is simpler, more parallelizable ( GPU friendly ) i.e. faster than RNN — it uses only straightforward matrix multiplication and simple few layer feed forward neural network with no recurrence and no weight sharing:

BERT sentence classification demo is available for free on Colab Cloud TPU. Generic BERT model is here fine tuned for MRPC task( determines if sentence pairs are semantically equivalent ).

For example, if input sentences are:

Ranko Mosic is one of the world foremost experts in Natural Language Processing arena. In a world where there aren’t that many NLP experts, Ranko is the one.

The model will conclude these two sentences are equivalent ( label = 1 ).


We are now proceeding with more detailed analysis⁴ that includes code snippets and variable dumps.

Where embeddings, tokenization, numericalization occur⁴ ( from Dissecting Bert )

BERT pretraining example takes tokenizes sentences from sample_text.txt, ( matches and ids tokens with content of vocab.txt )

The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor’s to supply the deficiency.

The above sample sentence is Wordpiece tokenized⁵ ( following initial basic tokenization — converting all tokens to lower case, punctuation split ) into:

[‘the’, ‘chips’, ‘from’, ‘his’, ‘wood’, ‘pile’, ‘refused’, ‘to’, ‘kind’, ‘##le’, ‘a’, ‘fire’, ‘to’, ‘dry’, ‘his’, ‘bed’, ‘-’, ‘clothes’, ‘,’, ‘and’, ‘he’, ‘had’, ‘rec’, ‘##ours’, ‘##e’, ‘to’, ‘a’, ‘more’, ‘provide’, ‘##nt’, ‘neighbor’, “‘“, ‘s’, ‘to’, ‘supply’, ‘the’, ‘deficiency’, ‘.’]

Below is layout of a final record written to the output file. 15% of tokens are randomly masked⁶, segmentation id is added(0 or 1 i.e A or B, padded to 128 — max segment length; segments/sentences can have content from different actual sentences); sentences are randomly shuffled and randomized next_sentence label is added.

Embedding starts⁷ with randomly initialized embedding_table ( modeling.py ); shape is (30522, 768) i.e. (vocab_size, embedding vector size ):

Randomly initialized embedding_table

BERT uses tf.nn.embedding_lookup(embedding_table, input_ids) to match each input token_id ( input_id ) with initial random 768 dimensional embedding:

Part of initial embedding for token_id 3536 ( “wood”) — array of 768 random numbers with stdev 0.02

The next step is to modify initial random embedding with positional encoding. BERT GitHub Python code does not implement positional embedding updates as described in 3.5 of Attention is All You Need⁸.

It starts with token_type_table:

token_type_table.shape is (2, 768) — two randomly generated 768 dimensional vectors:

[[0.00760975899 -0.0229291413 -0.0163771752…]…]

The initial batch of token_type_ids ( segment ids of shape 8, 128 ) is flattened into one 1024 vector, converted into one-hot representation, matrix multiplied with token_type_table and added ( matrix addition ) to the output:

Output is further salted with randomly initialized positional embeddings⁹:

A sample of output ( serves as starting position for Transformer step below )

We are moving on to the next part — Encoder block , Multi-Head Attention:

The first step is to create attention mask ( attention_mask is 1.0 for positions we want to attend and 0.0 for masked positions )— procedure below returns Tensor(“bert/encoder/mul:0”, shape=(8, 128, 128).

Attention Mask

Below is a call to transformer_model:

BERT Transformer has configurable ( bert_config.json ) number of self-attention heads (it is self-attention because from_tensor, to_tensor are the same — layer_input with shape (1024, 768) ):

from_tensor, to_tensor are transformed to query_layer, key_layer and value_layer via tf.layer.dense¹⁰:

This is the core moment ( Scaled Dot-Product Attention on Figure 2 below ) — dot product similarity ( attention i.e. attention_score) between query and key is calculated and softmaxed — converted to probabilities ( attention probabilities that add up to 1 ):

Image From Attention Is All You Need Paper

A standard dropout is applied ( with keep probability 1.0 – 0.1 = 0.9 ):

Input tensor — before and after dropout — 10% values are set to zero, the rest is multiplied by 1/0.9 = 1.11111

Finally scaled dot-product attention is matrix multiplication of attention_probs and value_layer:

We have now completed computation for a single attention layer¹¹.


Next comes feed forward part that is split into three layers of neural nets; only intermediate step has gelu activation function; outer layers feature dropout and layer normalization¹² ( for faster training ); layer_outputs with shape of (1024, 768) are appended to the all_layers list ( each layer_output is one of Nx layers on the Figure 1 above ):

Feed Forward

layer_outputs are finally brought back to their original shape ( (32, 128, 768):


Green marked area shows Transformer encoder part we have covered so far. We are now moving on to the red arrow marked entry to the Transformer decoder part ( both encoder and decoder also feature BERT specific token masking):

get_masked_lm_output is a function call that has as its inputs:

  • bert_config parameters
  • model.get_sequence_output(); this is the last layer of transformer_model output of a call to transformer_model above

self.sequence_output = self.all_encoder_layers[-1]

  • model.get_embedding_table() is token embedding ( explained above )
  • masked_lm_positions, masked_lm_ids, masked_lm_weights are tensors of shape (8, 20) that contain masked token positions, ids and weights

¹ NY Times wrote about BERT

² A quick way to try Transformer is Google’s Tensor2Tensor library of models and datasets.

³ https://twitter.com/stanfordnlp/status/1066742978381639680

⁴ Based on BERT github code and paper, Dissecting BERT, Illustrated BERT; work in progress

⁵ Tokenizes a piece of text into its word pieces. For example, “unaffable” = [“un”, “##aff”, “##able”]; uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. Sentences are randomly shuffled. Each token is assigned token_id ( unique accross all segments ); for example, ‘wood’ token_id is 3536

⁶ Please refer to page 6 of BERT paper for more details on why and how masking is done

Next, we get the embedding for each word in the sequence. Each word of the sequence is mapped to a emb_dim dimensional vector that the model will learn during training. You can think about it as a vector look-up for each token. The elements of those vectors are treated as model parameters and are optimized with back-propagation just like any other weights ( Dissecting Bert )

⁸ We assume below described PE is implemented in BERT C++ code.

Positional Embedding from section 3.5 of Attention Is All You Need

⁹ This is obviously not a proper PE implementation as no learning occurs and PE is reinitialized on each loop.

¹⁰ This layer implements the operation ( a standard NN layer ): outputs = activation(inputs * kernel + bias) where activation is the activation function passed as the activation argument (if not None), kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only if use_bias is True).

¹¹ Multi-head attention layers should be concatenated: On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2 (
Attention Is All You Need ). This code exists in Github BERT Python code, but is never executed:

attention_output = tf.concat(attention_heads, axis=-1)

Parallel multi-head execution is Transformer’s major point. Github code is all sequential though ( parallelization is perhaps achieved in deployment step ? )

¹² layer_norm is implemented via tf.contrib that will be removed from core TF 2.0 build process: