- The purpose of using padding is so that we can feed multiple sentences to the network at the same time. Let's say you have one sentence of length 10, another of 5 and one other of 12. You will want to fit all of the tensors in a single tensor to feed them together to the network. By padding all the sentences to be of length 12 you can have a tensor of shape [3,12] that can be fed to the network. For the example in the post, it is, however, irrelevant since we only have a single sentence. I guess that must have caused some confusion. But in short, you only need to PAD when you want to get embeddings of multiple sentences in a single call to the network. And yes PAD is only applied at the sentence level and not the dimension level.
- Can you explain what do you mean by normalized here? If you mean normalized by length, then the answer is yes.
- Just feed both sentences to the network and get embeddings for all the tokens in each sentence. You can then average the embeddings for all tokens for each sentence to get a sentence level embeddings for both of them. Then you can apply your similarity function to them. It might look something like this:
#Let’s say you have 2 sentences sent1 and sent2
sent1_tokens = tokenizer.tokenize(sent1)
sent2_tokens = tokenizer.tokenize(sent1)
sent1_ids = torch.tensor(tokenizer.convert_tokens_to_ids(sent1_tokens)).unsqueeze(0)
sent2_ids = torch.tensor(tokenizer.convert_tokens_to_ids(sent2_tokens)).unsqueeze(0)
hidden_reps_sent1, _= bert_model(sent1_ids) #Shape [1, l1, 768]
hidden_reps_sent2,_ = bert_model(sent2_ids) #Shape [1, l2, 768]
sent1_embed = hidden_reps_sent1.mean(dim = 1).squeeze() #Shape [768]
sent2_embed = hidden_reps_sent2.mean(dim = 1).squeeze() #Shape [768]
similarity = some_similarity_function(sent1_embed, sent2_embed)
Note: You might think why we didn’t use embedding of [CLS] token here i.e hidden_reps_sent1[:,0]. The reason is that [CLS] token is trained to contain the information about the sequence necessary for some classification problem (next sentence prediction to be exact). Hence it might not necessarily contain the semantic information of the complete sentence, hence averaging embeddings of all tokens should work better.
I hope this answers your questions.
