BERT — Multi class Text Classification on your dataset

KD
Analytics Vidhya
Published in
3 min readNov 10, 2019
Ruins of the ancient Nalanda University in Bihar, India
Ruins of the ancient Nalanda University in Bihar, India

I was working on multi-class text classification for one of my clients, where I wanted to evaluate my current model accuracy against BERT sequence classification.

And that’s how all of it started for me. Given the popularity BERT enjoys I was sure that I would get some online code help to compare result against my existing classification model. But, surprisingly I could not get anything handy and end up researching about it.

So, I thought of saving time for others and decided to write this article for those who wanted to use BERT for multi-class text classification on their dataset

Thanks to “Hugging Face” for transformers, that is now available in pytorch. You can start by installing the transformers pip. The model will be get installed on the first code run.

I have used bert-base-uncased as model, so further discussion in this write-up will be about this pre-trained model.

In order to use pre-trained models you need to stick to specified sequence featurization. Read following functions in the given order to understand the vectorization steps.

Creating dataset from your training/valid/test examples

Before you actually do training exercise it is important to understand the following function, this is the function that needs to be modified as per the model you are planning to use. Also, it is important to use specific tokenizer, so be careful.

The following featurization code is specific to bert-base-uncased model.

Vectorizing of the data

Once we have the vectorized dataset, we can go to training step. All other steps like running for a certain epochs, saving the models for evaluation remains same.

In my case, I have had around 10,000 training and 2,000 validation sentences. I run 10 epochs to get accuracy improvements against my existing classification model.

Running on your own data

You can directly run this code or notebook on your own dataset by arranging your data as discussed below:

  • installed dependencies like transformers, torch etc. (I have mentioned the list in repo.)
  • split your data into three usual three categories, “train, valid, and test” and store as CSV file.
  • The CSV file should at least have two columns, named “texts” and “labels
  • You must have guessed that “texts” should contains your sentence and “labels” should contain its class / category

Following are my hyper-parameters with respect to BERT evaluation. Once you are able to run with default values and evaluation, I would suggest playing around with “per_gpu_train_batch_size”, “max_seq_length” and “learning_rate” and rerun and evaluate results.

My work is committed here, feel free to fork and comment.

Running other pre-train models

You need to tweak my code a bit to run other available models like XLNet, GPT2 or RoBERT etc. In case you need help, do reach out to me. However, I am sure, you need to tweak or fix into vectorization step, as mentioned above.

Please clone transformers repository, specifically for binary classification or multi-class classification, look into “run_glue.py” code.

--

--