Kabir Ahuja
Nov 6 · 1 min read
  1. The vector representation of the sentence can be computed as:

#Feeding the input to BERT model to obtain contextualized representations cont_reps, _ = self.bert_layer(seq, attention_mask = attn_masks) #Obtaining the representation of [CLS] head

cls_rep = cont_reps[:, 0]

This will serve as a single vector representation for your sentence. It will be a 768-dimensional vector for each sentence. Remember we do not want its dimension to be 12, the sentence level representation should be independent of the number of words present in it.

2. As I mentioned in my previous comment, cls_head is computed by feeding hidden_reps[:,0] to a fully connected layer and applying a tanh activation function on it. Hence they won’t be equal. However, if you fine-tune the entire network for your problem, then it won’t make a huge difference if you use either of these.

    Kabir Ahuja

    Written by

    Research assistant at IISc, Bangalore. Working on Style Transfer for text. Previously worked on learning optimizers using reinforcement learning.