Kabir Ahuja
Nov 5 · 2 min read

Hi Antony,

Good questions, let me answer them one by one.

  1. hidden_reps will give the representations of all the tokens in the input sequence, this also includes [CLS] token. cls_head is obtained by taking the representation corresponding to the [CLS] token in the hidden_reps and applying a fully connected layer on top of that with tanh activation. You can either use hidden_reps[:,0] or cls_head as your sentence embedding, it shouldn’t make a lot of difference.
  2. The shapes are this way because hidden_reps is the tensor containing the representation of all 12 tokens (the first dimension is for batch size) and cls_head just contains the representation of [CLS] token.
  3. [CLS] token is added to the sequence to have a single vector representation of the entire sentence. Since we have self-attention in our architecture, it will have the information of all other tokens in the sequence. BERT is pretrained in such a way that the responsibility for classification is given to the [CLS] token (next sentence prediction task uses [CLS] representation for classification). This is more of a design choice to keep a token separate for classification tasks, however as I mentioned if you just pool the representation of all tokens that should also perform similarly.
  4. Use hidden_reps when you have a word-level task i.e. a task where you need the representation of all words like POS tagging where you want to tag each word in your sequence. cls_head or alternatively hidden_reps[:,0] should be used when you need a single representation of the entire sentence which is usually the case for classification problems.

I hope this clarifies your doubts. Let me know if anything is still not clear, I will be happy to help.

    Kabir Ahuja

    Written by

    Research assistant at IISc, Bangalore. Working on Style Transfer for text. Previously worked on learning optimizers using reinforcement learning.