What you should know first:
This post is not finished, I will be keep adding more details in the following days. Also, this post is kept as a continuing test report with BERT and Fastai. I felt like I will lose the courage to write a post every time when I finish the project as the whole. Also, there might be a lot of mistakes as I will frequently report back my finds.
Target audience: People who wants to integrate BERT into fastai for downstream NLP tasks.
There is great post about how this has been down previously, please check out the previous work, most of my understanding is also following the idea in this post.
What I have changed:
- Datablock API, the factory method is great, but often you come to a moment that needs to change some minor things. As my journey from fastai vision, when you want to have customized model architecture, you often needs to modify datablock API to suit your model. (such as object detection, segmentation)
2. To be added
How to load BERT with datablock API.
NLP is a bit different in terms of computer vision. Where in the image, your input is perfect un-normalized 3 stacked matrix, what I want to say here is, they are numbers.
But in the text, they are just words. No matter CNN or RNN, they underline is they like numbers. Therefore we have to turn string / words to numbers.
You often heard people talking about tokenization and numericalization, this is most important step to prepare image like input for NLP model. However, things are little bit different in language.
English is great, as you read this post. It can be space separable. However, language like Chinese is not space separable, therefore if the intend work is focusing on those languages. Sentence Piece is often used. I will keep my work focus on English.
Great things are, we can use BERT’s token
'''Code taken from
https://medium.com/@abhikjha/fastai-integration-with-bert-a0a66b1cecbe'''from pytorch_pretrained_bert import BertTokenizerbert_tok = BertTokenizer.from_pretrained(
'''wrapper for fastai tokenizer'''
def __init__(self, tokenizer, max_seq=128, **kwargs):
self._pretrained_tokenizer = tokenizer
self.max_seq_length = max_seq
return ["[CLS]"] + self._pretrained_tokenizer.tokenize(t)[:self.max_seq_length - 2] + ['[SEP]']
There is nothing magic here, BERT doesn’t speak our fastai language, they don’t like #bos #eos. BERT speaks [CLS], [SEP]. Sure, let’s add it. Since your take two tokens out, your max_seq_length allowed needs to cut those two, that’s why you have magic number (2) in the last line.
fastai_bert_vocab = Vocab(list(bert_tok.vocab.keys()))fastai_tokenizer = Tokenizer(tok_func=FastaiBertTokenizer(bert_tok,max_seq=256),pre_rules=,post_rules=)
Fastai functions, since you are using non fastai supported things, we have to do manually.
Once this is done, we are ready. Only thing we changed is how to Process our text, using BERT tokenizer (not fastai one, actually fastai doesn’t have one, they are using spacy)
processor = [OpenFileProcessor(),TokenizeProcessor(tokenizer=fastai_tokenizer,include_bos=False),NumericalizeProcessor(max_vocab=40000)]
Feel free to use any max_vocab, I was running the below cell figured that 30000 might not be a good number.
len(fastai_bert_vocab.itos)#outputs 30522 on IMDB train setdata = TextList
.from_folder(path/’train’,vocab=fastai_bert_vocab,processor=processor)data = data.split_by_rand_pct()data = data.label_from_folder()data = data.databunch(bs=64)
Here, the first task is done.
- Load BERT as encoder, create decoder for classification
- Fine tune BERT language model on domain corpus, create decoder for classification.
- Port OPT-2 as encoder. Test Nucleus sampling.