Google BERT — Pre Training and Fine Tuning for NLP Tasks

Ranko Mosic
Nov 5, 2018 · 7 min read

A recently released BERT paper and code generated a lot of excitement in ML/NLP community¹.

BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus ( BooksCorpus and Wikipedia), and then use that model for downstream NLP tasks ( fine tuning )¹⁴ that we care about.

Models preconditioned with BERT achieved better than human performance on SQuAD 1.1 and lead on SQuAD 2.0³. BERT relies on massive compute for pre-training ( 4 days on 4 to 16 Cloud TPUs; pre-training on 8 GPUs would take 40–70 days i.e. is not feasible. BERT fine tuning tasks also require huge amounts of processing power, which makes it less attractive and practical for all but very specific tasks¹⁸ ). Typical uses would be fine tuning BERT for a particular task or for feature extraction.

BERT generates multiple, contextual, bidirectional word representations, as opposed to its predecessors (word2vec, GLoVe ).

BERT proposes a new training objective: the “masked language model” (MLM)¹³ . The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of
the masked word based only on its context.

The basic BERT building block is the Transformer ( as opposed to RNN based options like BiLSTM ). Central to the Transformer is the notion of attention —contextual co-occurrence statistics¹⁷.

Image for post
Image for post

Transformer is simpler, more parallelizable ( GPU friendly ) i.e. faster than RNN — it uses only straightforward matrix multiplication and simple few layer feed forward neural network with no recurrence and no weight sharing. BERT only implements Transformer encoder part ¹⁶:

Image for post
Image for post

BERT sentence classification demo is available for free on Colab Cloud TPU. BERT language model is fine tuned for MRPC task( sentence pairs semantic equivalence ).

For example, if input sentences are:

Ranko Mosic is one of the world foremost experts in Natural Language Processing arena. In a world where there aren’t that many NLP experts, Ranko is the one.

The model will conclude these two sentences are equivalent ( label = 1 ).

Image for post
Image for post

Image for post
Where embeddings, tokenization, numericalization occur⁴ ( from Dissecting Bert )

BERT pretraining example tokenizes sentences from sample_text.txt, ( matches and ids tokens with content of vocab.txt )

The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor’s to supply the deficiency.

The above sample sentence is Wordpiece tokenized⁵ ( following initial basic tokenization — converting all tokens to lower case, punctuation split ) into:

[‘the’, ‘chips’, ‘from’, ‘his’, ‘wood’, ‘pile’, ‘refused’, ‘to’, ‘kind’, ‘##le’, ‘a’, ‘fire’, ‘to’, ‘dry’, ‘his’, ‘bed’, ‘-’, ‘clothes’, ‘,’, ‘and’, ‘he’, ‘had’, ‘rec’, ‘##ours’, ‘##e’, ‘to’, ‘a’, ‘more’, ‘provide’, ‘##nt’, ‘neighbor’, “‘“, ‘s’, ‘to’, ‘supply’, ‘the’, ‘deficiency’, ‘.’]

Below is layout of a final record written to the output file. 15% of tokens are randomly masked⁶, segmentation id is added(0 or 1 i.e A or B, padded to 128 — max segment length; segments/sentences can have content from different actual sentences); sentences are randomly shuffled and randomized next_sentence label is added.

Image for post
Image for post

Embedding starts⁷ with randomly initialized embedding_table ( modeling.py ); shape is (30522, 768) i.e. (vocab_size, embedding vector size ):

Image for post
Image for post
Randomly initialized embedding_table

BERT uses tf.nn.embedding_lookup(embedding_table, input_ids) to match each input token_id ( input_id ) with initial random 768 dimensional embedding:

Image for post
Image for post
Part of initial embedding for token_id 3536 ( “wood”) — array of 768 random numbers with stdev 0.02

The next step is to add positional to input embedding. Positional embedding is described in section 3.5 of Attention is All You Need.

Image for post
Image for post
Image for post

Image for post

Multi-head attention starts with attention mask ( 1.0 for positions we want to attend to and 0.0 for masked positions )— procedure below returns Tensor(“bert/encoder/mul:0”, shape=(8, 128, 128).

Image for post
Image for post
Attention Mask

Below is a call to transformer_model:

BERT Transformer has configurable ( bert_config.json ) number of self-attention heads (it is self-attention because from_tensor, to_tensor are the same — layer_input with shape (1024, 768) ):

from_tensor, to_tensor are transformed to query_layer, key_layer and value_layer via tf.layer.dense¹⁰:

This is the core moment ( Scaled Dot-Product Attention on Figure 2 below ) — dot product similarity ( attention i.e. attention_score) between query and key is calculated:

Image for post
Image for post
Image From Attention Is All You Need

A standard dropout is applied ( with keep probability 1.0 – 0.1 = 0.9 ):

Image for post
Image for post
Input tensor — before and after dropout — 10% values are set to zero, the rest is multiplied by 1/0.9 = 1.11111

Finally scaled dot-product attention is matrix multiplication of attention_probs and value_layer:

We have now completed computation for a single attention layer¹¹.


Next comes feed forward part that is split into three layers of neural nets; only intermediate step has gelu activation function; outer layers feature dropout and layer normalization¹² ( for faster training ); layer_outputs with shape of (1024, 768) are appended to the all_layers list ( each layer_output is one of Nx layers on the Figure 1 above ):

Image for post
Feed Forward

layer_outputs are finally brought back to their original shape ( (32, 128, 768):


¹ NY Times wrote about BERT. In a nutshell BERT is a humongous encoder — it features state of the art contextual representation of a huge text corpus: Wikipedia/BookCorpus -> BERT -> word encodings ( model i.e. weights ).

..“encoder-only” models like BERT are designed
to produce a single prediction per input token or a single prediction for an entire input sequence. This makes them applicable for classification or span prediction tasks but not for generative tasks like translation or abstractive summarization

³ https://twitter.com/stanfordnlp/status/1066742978381639680

⁵ Tokenizes a piece of text into its word pieces. For example, “unaffable” = [“un”, “##aff”, “##able”]; uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. Sentences are randomly shuffled. Each token is assigned token_id ( unique accross all segments ); for example, ‘wood’ token_id is 3536

⁶ Please refer to page 6 of BERT paper for more details on why and how masking is done

Next, we get the embedding for each word in the sequence. Each word of the sequence is mapped to a emb_dim dimensional vector that the model will learn during training. You can think about it as a vector look-up for each token. The elements of those vectors are treated as model parameters and are optimized with back-propagation just like any other weights ( Dissecting Bert )

¹⁰ This layer implements the operation ( a standard NN layer ): outputs = activation(inputs * kernel + bias) where activation is the activation function passed as the activation argument (if not None), kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only if use_bias is True).

¹¹ Multi-head attention layers are concatenated:

attention_output = tf.concat(attention_heads, axis=-1)

¹² layer_norm:

¹³ This is an example of self-supervised learning

¹⁴ Real life BERT based applications are: Google search improvement ( Oct 25, 2019 update — I am guessing BERT is used in supervisory role for search results reranking : In a recent talk at Google Berlin, Jacob Devlin described how Google are (sic) using his BERT architectures internally. The models are too large to serve in production, but they can be used to supervise a smaller production model ); sentiment analysis, classification

¹⁵ Interview with BERT first author Jacob Devlin

¹⁶ GPT-2 is using Transformer decoders.

¹⁷ Input sequence is split into vectorized tokens; logically each token is a query that is correlated with the rest of the tokens —keys ( and their corresponding values ).

¹⁸ General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store