Finetuning an XLNet for sentiment extraction in tensorflow :
Contents:
- Brief Intro to XLNet
- What is the sentiment extraction task?
- Tokenization
- Building Model
Brief Intro to XLNet:
XLNet is one of the state-of-the-art models in the field of NLP. It took over the empire from the BERT model by outperforming it on 20 tasks which includes question answering, natural language inference, sentimental analysis, and document ranking. Unlike BERT, XLNet takes the advantage of pre-training on both Auto-Regressive(AR) language modeling and Auto Encoding.
Okay, wait! what are these? Don’t worry we’ll figure out what the above words mean below.
- Auto-Regressive language modeling:
Here, we will ask the model to predict the word in steps. Let me make this more clear by taking an example:
let us consider the sentence ‘Unsupervised representation learning has been highly successful in the domain of natural language processing.’
Now during training, the model will be given ‘Unsupervised representation learning has’ and asked to predict the next word which is the word ‘been’ here. And it isn’t completed yet, now again model will be given ‘Unsupervised representation learning has been’ and asked to predict the next word which is ‘highly’ now! And this process continues.
input 1 → ‘Unsupervised representation learning has’
output 1 → ‘been’
input 2 → ‘Unsupervised representation learning has been’
output 2 → ‘highly’
.
.
.
2.Auto Encoding:
Auto Encoding is a bit different than what we have discussed earlier. Here, some of the words in the sentence will be masked and now the model is supposed to predict these masked words. Let me illustrate this, by taking the same above example.
The model will be provided with the above sequence but with some masked tokens (as shown below) and asked to predict those words :
Unsupervised representation <MASK> has been highly <MASK> in the domain of natural language processing.
BERT’s pretraining objective was only masked language modeling (which is a specific kind of auto encoding) but not autoregressive modeling but whereas XLNet uses both.
The advantage of autoregressive modeling is when the model is trying to predict one word, it will have access to all words including the words that it has predicted earlier which is not the same with autoencoding, in which it will have access only to non-masked tokens.
I have tried to give a very crude idea of how XLNet is different from BERT.
For more information, I would strongly recommend this awesomely written medium article.
What is sentiment extraction?
Sentiment extraction is natural language processing task where a sentence and it’s corresponding sentiment(whether it is a positive or negative or neutral sentence) will be given and model is supposed to extract phrases or set of words which strongly support the given sentiment.
How do we do that?
We simply try to predict the start and end indices of the text that strongly agrees with the given sentiment.
Now, let’s make our hands dirty by learning along with code.
Note: Here, I am not particularly using any data for this task, I am going to provide the basic logic along with code so that one can embed this logic in their own models on their own custom datasets.
Coding models like XLNet from scratch and pretraining them is really laborious task. A million thanks to Hugging Face Transformers, this repository made our task simpler. It provides pretrained transformer models and tokenizers…it also provides an XLNet model for question answering task(much similar to sentiment extraction task) but we will finetune our own model here.
Tokenization:
This is the first step mostly in all natural language processing tasks. We need to tokenize the sentences.
So why tokenizing?
Because unfortunately, computers are not good at dealing with words but fortunately, they are excellent at numbers.
XLNet uses sentence piece tokenizer for tokenizing its text, however, a pretrained tokenizer for XLNet is provided in hugging face transformers.
from transformers import *tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased'
,do_lower_case = True)
The above piece of code is for importing transformers library and pretrained XLNet tokenizer. The code we are going to discuss expects training set in the data frame having columns ‘text’, ‘sentiment’, and ‘selected_text’.
Now, let us see how to tokenize a sentence and obtain a data point so that we can put that same logic in a loop to tokenize the whole dataset.
XLNet expects three types of inputs:
- input_ids : Tokenized inputs which are padded to maximum length with <cls> and <sep> tokens where ever necessary.
- attention_mask: This helps our XLNet model to attend to non-pad tokens only.
- token_type_ids: If our input has two different types of sequences for instance in our task we have two different sequences, one is a sentence from which text is to be extracted and another is a corresponding sentiment. So these two can be distinguished by giving different values in token_type_ids.
For simplicity, in this post I am going to neglect token type ids in this post, performance might slightly differ if we use token type ids.
Okay! Now, let’s tokenize and obtain input ids and attention mask for a sentence.
Input sentence : ‘Paris is a wonderful place!’
sentiment : ‘Positive’
Lots of code! No worries, let’s go through it part by part!
- Line 3: I created a dictionary mapping special tokens corresponding to their number.So that we can avoid calling tokenizer.encode() each time.
- Line 7,8: tokenizer.encode() by default adds sep and cls tokens at the end, but here as we need to do more (like finding offsets) we will exclude them.
Note 1 : tokenizer.encode() might sometimes break a single word into pieces ,for example it breaks paris to pari & s.
Note 2 : Offsets helps us to go back to the original sentence ,it stores the beginning and ending index for every token.
3. Line 10,11,12: We create an array chars by assigning ones to the positions corresponding to selected text and zeros for remaining.
4. Line 15,16,17: We find offsets to know the positions of each token and store it in a list.
5. Line 20,21,22: This code helps us to find out which token in sentence encoding is also in our selected text(so that we can compute start and end indices from this).
That’s all! Put the above logic in a for loop to tokenize the whole dataset. Now, let us build a CNN head on top of XLNet.
Building Model:
The above code builds a simple CNN head on top of XLNet which takes input_ids and attention mask as inputs and start and end indices as outputs.
That’s it! we are done.
This model can now be trained by choosing optimizer(maybe adam optimizer) and a loss function(generally categorical cross entropy).
Conclusion:
I hope you enjoyed reading this post. If you have any queries please do post it here. To see the full code, feel free to visit this GitHub repo.