Running Pytorch-Transformers on Custom Datasets

6 min readJul 24, 2019

Includes code and test results for IMDB movie review sentiment classification

I have been a long time fastai student/user. I find it to be one of the best way to learn about ML/DL and build SOTA models with as few a resources as possible. Like many I too am grateful to Jeremy & Rachel Thomas and the entire fastai team for this awesome piece of work.

However exactly a week ago when @huggingface released pytorch-transformers, a library of state-of-the-art pretrained models for Natural Language Processing (NLP), it caused quite a flutter in the (NLP) world.

This was a BIG moment after all. Years of excrutiating work by top researchers from various companies had been distilled into a simple-to-use library which people like me can use and build world-class applications.

What a time to be alive indeed !!

Now, knowing two libraries is better than one so I decided to get familiar with it. First thing I noticed is, while it certainly provides an easy way to run the various models (BERT, GPT, Transformer-XL, XLNet, XLM) on standard bench-marking datasets (GLUE, SQuAD), there weren’t any clear instructions on how to run on your own dataset. So I decided to do that myself. I chose the IMDB movie reviews dataset since it will be easier to compare with current SOTA.

With that background lets jump straight into some code.

Code

Github links to pytorch-transformers repo & my extension code

There’s one other thing that bothered me with the way the code was structured, code duplication. Separate scripts are written for the two example datasets where most of the code to train and evaluate is common. While I understand the rationale behind this, which would be to provide a reference implementation and therefore keep the code simple and self-contained, I would have preferred to only write a **new function** that processes the input data into the form expected by the underlying API for every new dataset.

I have kept the code structure as similar to the original one as possible.

run_dataset.py: Minimal changes. Here’s the diff between this and run_glue.py.

utils_dataset.py: Added new ImdbProcessor class to represent IMDB dataset. More such processors need to be added for any new dataset.

Execution

To execute pytorch-transformer on IMDB dataset, download above two files in a folder of your choice
Set the IMDB_DIR enviroment variable to where your IMDB dataset is present. For e.g. export IMDB_DIR=~/data/aclImdb
Run command:

$ python run_dataset.py --task_name imdb --do_train --do_eval --do_lower_case --data_dir $IMDB_DIR/ --model_type bert --model_name_or_path bert-base-uncased --max_seq_length 128 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/imdb_output/

Observations

I have used single Tesla V100 GPU on GCP. So far I have only experimented with BERT model. Following is a quick summary of results obtained during different runs.

In short, higher max_seq_length affects the accuracy more than anything. This is because many IMDB reviews have between 300 to 400 words. I could not test bert-large-uncased model with max_seq_length greater than 256 due to CUDA Out of memory errors.

Fine-tuning of BERT Language Model

I used the unsupervised data (train/unsup folder) from the IMDB dataset to finetune the language model.

BERT uses a masked language model to enable training of bidirectional models, and also adds a next sentence prediction task to improve sentence-level understanding. It expects the training corpus to in a specific format, which is sentences from a single document have to be on a separate line with blank lines separating different documents. This allows creation of both positive and negative training data. See here for a sample document.

In the IMDB dataset, every movie review is in a different file. Hence we need to first convert to the format mentioned above. Script ‘sentence_segmentation.py’ does just that using spaCy. It also removes HTML tokens and “ * characters which were creating problems during segmentation. Additionally, there were few lines present which had a single line of special character. This would later cause an assertion that expects at least one valid token on every line.

The python script can executed as below:

$ python sentence_segmentation.py --input_dir ~/data/aclImdb/train/unsup --output_file imdb_corpus.txt

The single character lines can be removed using below ‘sed’ command:

$ sed ‘/^.$/d’ imdb_corpus.txt > imdb_corpus_1.txt

Once you have the final corpus data ready, you can then follow the instructions given in pytorch-transformers repo/examples/lm_finetuning README to finetune the language model. The commands I used are given below:

# First command to generate training data:
# ==========================================
$ python ~/huggingface/pytorch-transformers/examples/lm_finetuning/pregenerate_training_data.py --train_corpus lm_finetuning/imdb_corpus_1.txt --bert_model bert-base-uncased --do_lower_case --output_dir lm_finetuning/training --epochs_to_generate 3 --max_seq_len 128# Second command to create the fine-tuned language model:
# =======================================================
$ python ~/huggingface/pytorch-transformers/examples/lm_finetuning/finetune_on_pregenerated.py --pregenerated_data lm_finetuning/training/ --bert_model bert-base-uncased --do_lower_case --output_dir lm_finetuning/finetuned_lm/ --epochs 3# Finally, re-train on fine-tuned model:
# ======================================
$ python run_dataset.py   --task_name imdb  --do_train  --do_eval   --do_lower_case   --data_dir $IMDB_DIR/   --model_type bert --model_name_or_path ~/nlp_projects/examples/lm_finetuning/finetuned_lm --max_seq_length 512    --learning_rate 2e-5   --num_train_epochs 4.0  --gradient_accumulation_steps=4 --output_dir /tmp/imdb_output_11/ --save_steps 1000

With this I saw the accuracy go upto 94.5%. Not bad considering the fine-tuning was done for bert-base model and not bert-large. I believe bert-large with max_seq_length of 512 will give the best results but we’ll have to use multiple GPUs along with mixed precision mode (fp16). I can also use training as well as test data from the IMDB dataset for fine-tuning.

Note: I faced an issue in running “finetune_on_pregenerated.py”. Later saw that the script was updated but the master branch wasn’t. If you face any issue in running it then download the latest version of the file from the repo.

Running in Mixed-Precision mode (FP16)

After a bit of a struggle, I could finally get the FP16 mode working. First up you need to install the CUDA toolkit and then make sure the path to ‘nvcc’ utility is added to $PATH (check using nvcc — version)

Post that I faced trouble while installing apex.

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .ERROR: Command "/home/nikhil_subscribed/anaconda3/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-req-build-ft6absjv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-5uvui08_/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-ft6absjv/

So to get around this I commented below check in setup.py. After that installation went through without any other issue and I could train in FP16 mode.

"""if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
                           "not match the version used to compile Pytorch binaries.  " +
                           "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
                           "In some cases, a minor-version mismatch will not cause later errors:  " +
                           "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
                           "You can try commenting out this check (at your own risk).")"""

Next Steps

While it is tempting to try and get higher & higher accuracy numbers and a combination of distributed training, FP16, bert-large model, max_seq_length of 512 should definitely do that but as this point I want to focus on learning and deploying end-to-end applications. So next up I’ll be looking into deploying these models into production.