GluonNLP 0.6: Closing the Gap in Reproducible Research with BERT

Eric Haibin Lin
Apache MXNet
Published in
4 min readMar 19, 2019


BERT (Bidirectional Encoder Representations from Transformers) is arguably the most notable pre-training model in natural language processing (NLP). For instance, BERT lifts the score from 72.8 to 80.5 in the GLUE benchmark for 9 different NLP tasks — this is the biggest recent advancement[6].

Although BERT is exciting, unfortunately there have been no open source implementations that simultaneously

  • enable scalable pre-training with GPUs;
  • reproduce results on various tasks;
  • support model exporting for deployment.

Thus, we release GluonNLP 0.6 to address such pain points by i) pre-training BERT with 8 GPUs in 6.5 days; ii) reproducing multiple natural language understanding results; iii) streamlining deployment.

GluonNLP’s mission is to provide a one stop shop experience for easy prototyping for deep learning models for NLP.

We Pre-train BERT with 8 GPUs in 6.5 Days

In case you live under a rock and have not heard of BERT, here is how it works. It 1) uses stacked bidirectional transformer encoders, 2) learns parameters by masked language modeling and next sentence prediction on large corpora with self-supervision, and 3) transfers these learnt text representations to specific downstream NLP tasks with a small set of labeled data by fine-tuning.


You may wonder: the official BERT repository has released multiple pre-trained models for free from the result of many TPU hours, why should we still care about pre-training BERT? This is because the choice of corpus for pre-training is very important. Like any other transfer learning setting, the model is more likely to perform well if the pre-trained data source is close to the task at hand. For example, pre-training on Wikipedia may not help us do better on tweets due to the difference in language style.

We pre-trained the BERT Base model from scratch. We used an English Wikipedia data dump that contains 2.0 billion words after removing images and tables, and a books corpus dataset which contains 579.5 million words after de-duplication. With mixed precision training and gradient accumulation, the BERT Base model takes 6.5 days using 8 Volta 100 GPUs and achieves the following results on validation sets.

We Reproduce BERT Fine-tuning for NLU Tasks with Scripts and Logs Available

Promoting reproducible research is one of the important goals of GluonNLP. In GluonNLP, we provide both training scripts and logs that replicate state-of-the-art results on RTE[6], MNLI[8], SST-2, MRPC[10], SQuAD 1.1[9] and SQuAD 2.0[11]. Our code is modularized to facilitate BERT on many tasks in one framework.

We report F1 and exact match scores on the validation set for question answering datasets:

Below please find accuracies on validation sets for the following sentence classification tasks with BERT Base model:

We Streamline BERT Deployment

With the power of MXNet, we provide BERT model that can be serialized into json format and deployed in in C++, Java, Scala, and many other languages. With float16 support, we see approximately 2 times speed up for BERT inference on GPUs. We are also working on int8 quantization on CPUs.

How to Get Started?

To get started with BERT using GluonNLP, visit our tutorial that walks through the code for fine-tuning BERT for sentence classification. You can also check out our BERT model zoo for BERT pre-training scripts, and fine-tuning scripts for SQuAD and GLUE benchmarks.

For other new features added in GluonNLP, please read our release notes. We are also working on BERT distributed training enhancements, and bringing GPT-2, BiDAF[12], QANet[3], BERT for NER/parsing, and many more to GluonNLP.

Happy BERTing with GluonNLP 0.6!


We thank great contributions from the GluonNLP community: @haven-jeon @fiercex @kenjewu @imgarylai @TaoLv @Ishitori @szha @astonzhang @cgraywang


[1] Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018).

[2] Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018).

[3] Yu, Adams Wei, et al. “Qanet: Combining local convolution with global self-attention for reading comprehension.” arXiv preprint arXiv:1804.09541 (2018).

[4] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[5] Sebastian Ruder. NLP’s ImageNet moment has arrived, 2018 (accessed November 1, 2018). URL

[6] Wang, Alex, et al. “Glue: A multi-task benchmark and analysis platform for natural language understanding.” arXiv preprint arXiv:1804.07461 (2018).

[7] Liu, Xiaodong, et al. “Multi-Task Deep Neural Networks for Natural Language Understanding.” arXiv preprint arXiv:1901.11504 (2019).

[8] Williams, Adina, Nikita Nangia, and Samuel R. Bowman. “A broad-coverage challenge corpus for sentence understanding through inference.” arXiv preprint arXiv:1704.05426 (2017).

[9] Rajpurkar, Pranav, et al. “Squad: 100,000+ questions for machine comprehension of text.” arXiv preprint arXiv:1606.05250 (2016).

[10] Dolan, Bill, Chris Brockett, and Chris Quirk. “Microsoft research paraphrase corpus.” Retrieved March 29 (2005): 2008

[11] Rajpurkar, Pranav, Robin Jia, and Percy Liang. “Know What You Don’t Know: Unanswerable Questions for SQuAD.” arXiv preprint arXiv:1806.03822 (2018).

[12] Tuason, Ramon, Daniel Grazian, and Genki Kondo. “BiDAF Model for Question Answering.” Table III EVALUATION ON MRC MODELS (TEST SET). Search Zhidao All.