Modeling using Hugging Face Transformers
We often model our data using scikit-learn for supervised learning and unsupervised learning task. We familiarize with object oriented design such as initiate class and call children function from the class. However, when I personally use PyTorch, I find a similar but not the same design pattern like scikit-learn.
To train your model using PyTorch, you have to create a class for dataset and model and add an inheritance from each class. for example, you create a class called TextDataset(Dataset)
and that Dataset
is a class from torch.utils.Dataset
. To create your model, you have to create a class let’s say Classifier(nn.Module)
that nn.Module
from torch.nn as nn
Compare to scikit-learn, PyTorch lets you build your own class and do some stuffs in the class.
Let’s talk about Transformers developed by Hugging Face. I noticed the library when I saw one of the research scientist which connected to my LinkedIn account post about new features added. But, at that time, I wasn’t thinking to try the library immediately. Then, it comes to the time when I was wondering about BERT architecture and want to implemented it.
If you are new using PyTorch. Try to build some models using it for example building model for text classification or image classification to get started. PyTorch has a great documentation for it. After you familiarize using PyTorch on how to train and evaluate your model. You may use Transformers easily.
I started to explore Transformers from their documentation only. I follow Hugging Face twitter account which often tweet about recent updates about their libraries and development. This makes me easier to catch up their updates. Look at the usage page at Transformers documentation. It explains how to use varieties of application in natural language processing, from sequence classification to neural machine translation.
Around deep learning community, we often emphasize on pre-trained model. A model that had been pre-trained and you can use for another similar task which also known as transfer learning. Hugging Face hosts pre-trained model from various developers. They made a platform to share pre-trained model which you can also use for your own task. Look at the page to browse the models!
After that, I wonder, what if I want to train my own model using BERT architecture without using pre-trained model which means training the model from scratch. Because most people discuss on pre-trained model from blog post or research papers using BERT-base-uncased as the example. Then I remembered, PyTorch is quite different compared to Keras which has a function fit
and predict
.
I began to train text classifcation using PyTorch from kaggle competition dataset. Real or Not? NLP with Disaster Tweets is a good choice to start. I had used XGBoost and CatBoost and doing some magical things on data cleansing and feature extraction but the score won’t up. Then, I thought BERT would add the score up in leaderboard. It turned out that the score was increase to 83% on mean F-score and became top 12% on leaderboard using bert-base-uncased
!
To train the model from scratch, I create a function to generate a pipeline.
As you know, Transformers need three components to infer or train your model. AutoConfig
is used for setting up model and tokenizer configurations. You can change AutoConfig
to BertConfig
or any other achitectures that availbale in Transformers. AutoTokenizer
applied the same as configuration. But tokenizer here using pre-trained which means, I use tokenizer from bert-base-uncased. It has loaded its own vocabulary you can look the vocab.txt for each token. So I did not build my own vocabulary. The last component, AutoModelForSequenceClassification is loaded from config because I want to train from scratch. AutoModel can be change to BertForSequenceClassification
.
Second step, I create DisasterDataset class for loading the dataset.
The script above is for building class dataset. I did not using any preprocessing, any cleansing. Just using plain text and tokenize using BertWordPiece from Tokenizers library.
Finally, the script above is to train the model. I use Adam
optimizer with learning rate to 0.0001 and using scheduler StepLR()
from PyTorch with step_size to 20 and gamma to 0.01. For criterion, I use CrossEntropyLoss()
. Even if the task is binary which suppose to use Binary Cross Entropy. But the model returns probability of the class and softmax as an activation function.
I was running the script on compute engine using GPU Nvidia K-80, it was pretty fast to get the result because I set running_accuracy > 90 and running_accuracy_val > 90
would be stopped. The script would not run 100 epochs to finished.
Conclusion
Finally, BERT architecture is useful and easy to use more than ever with Transformers library. In Jakarta Artificial Intelligence Research, an AI research community based in Jakarta, mostly use PyTorch and Transformers to develop and run experiment for the projects. We also try to build our own Encoder to surpass existing Encoder such as GPT and BERT architecture.
A couple days ago when I am writing this article, Jakarta Artificial Intelligence Research recently published the preprint paper on paraphrase identifcation (which is also using Transformer) in arXiv and gave incredible result. Check it out