How to use BERT Question Answering in TensorFlow with NVIDIA GPUs

NVIDIA AI

Published in

NVIDIA

6 min readNov 12, 2019

To experiment with BERT, or to learn more, consult this notebook, containing the full code.

What is BERT?

BERT, or Bidirectional Encoder Representations from Transformers, is a state-of-the-art NLP model. Think of it as a huge step forward in a complex area of Deep Learning that places even more importance on GPU infrastructure as a result of exponentially increasing compute complexity.

Detailed in this Google paper, BERT has achieved impressive results with tasks such as question answering, SQuAD v1.1, and natural language inference, MNLI. Google is now using BERT to serve search results, offering more contextually relevant results for your queries.

Drawing from the original Google implementation, we’ve created NVIDIA BERT, an optimized version that leverages mixed-precision arithmetic and tensor cores on NVIDIA Tesla V100 GPUs to enable reduced training times, while maintaining accuracy.

Here, and in the accompanying notebook, we’ll walk through a BERT example using TensorFlow and mixed-precision floating point mathematics.

Pre-Trained NVIDIA BERT in NGC

NGC provides state-of-the-art, optimized, machine learning frameworks, as well as models and scripts that can help you accelerate your journey to AI and ensure that you’re always using the best tools for the job.

In this example, we’ll utilize pre-trained models on NGC, using the following BERT configuration:

There are several configurations available for BERT, but we’ll be using the second of these two (FP16) examples, both of which have been trained on the SQuaD 2.0 Dataset:

bert_tf_v2_large_fp32_384
bert_tf_v2_large_fp16_384

We then need to set the flag to ensure that we’re using the mixed-precision model. Not only will this take considerably less training time than the FP32 version, it’ll do so without compromising accuracy.

To highlight the potential performance gain we can achieve using mixed-precision mode for the mode, we’ve included the following benchmarks for both training and inference:

As you can see from these test results, using mixed precision not only reduces training time for a similar or better result, but it also reduces the time taken to perform inference.

With natural language question and answer systems, the time taken for the model to perform inference on unseen text is critical. Ensure that people who routinely engage with such systems provide answers in a reasonable amount of time.

Using mixed precision for inference allowed us to see a 2.74 sentence per second speedup. This translates into more text analyzed and more users getting their results much faster.

You can learn more about the performance implications of using FP16 or FP32 for training BERT on this NGC page.

Additionally, we can check whether the mixed-precision mode is being used and download the appropriate model from NGC to provide a cleaner user experience:

Beyond the pre-trained models, NGC also provides a collection of helper scripts in the model scripts registry. You can use these for dataset preprocessing, pretraining, and fine-tuning the model.

You can also use WGET to grab the scripts directly from NGC and unzip them into your workspace. These scripts do a lot of the heavy lifting so that you can start using BERT faster.

BERT Inference: Question Answering

With the BERT model set up and tuned, we can now prepare to run an inference workload. This BERT model, trained on SQuaD 2.0, is ideal for Question Answering tasks. SQuaD 2.0 contains over 100,000 question-answer pairs on 500+ articles, as well as 50,000 unanswerable questions.

For this example, let’s explore the following paragraph. We’ll be able to test our BERT model by asking questions that correspond to the following text about the Apollo program:

The Apollo Program — “The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower’s administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy’s national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973–74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.”

Here’s a subset of the questions we’ll be using to test BERT:

Which project put the first Americans into space?
What year did the first manned Apollo flight occur?
Who did the U.S. collaborate with on an Earth orbit mission in 1975?
What is Apollo?
How long did Project Apollo run?

You can also easily use a different paragraph for testing. Just change the notebook to provide a new context/question set relating to your new text. We’ve shown you how to do that in the notebook, but it’s a simple case of providing a JSON object with a new context and question set. Here’s an empty skeleton for reference:

Since we’re simply constructing a dictionary with a context and a list of questions, it’s also possible to have users supply their questions at runtime. The following code snippet retrieves a list of user questions and appropriately formats them so that they can be supplied for our BERT model.

Once we’ve set up BERT and supplied a context (a paragraph of text) and a question set, we’re ready to run our inference workload. Among the helper scripts we downloaded earlier is run_squad.py, which simplifies the process.

Now, we need to supply the following parameters to the run_squad.py script:

 — bert_config_file: the JSON config which specifies the pretrained BERT model architecture.
 — vocab_file: the training vocabulary file, used to train the BERT model.
 — init_checkpoint: the starting, or initial, checkpoint from a pretrained BERT model.
 — output_dir: the write location of the output weight checkpoints.
 — do_predict: the execution mode of the model, training vs inference.
 — predict_file: the file on which to perform inference.
 — prediction_path: where to store the inference results.
 — doc_stride: When splitting up a long document into chunks, how much stride to take between chunks
 — max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.

The notebook sets up these parameters ahead of time, so it should simply be a case of executing this code block to run inference:

Lastly, calling display_results(predict_file, output_prediction_file) will reveal BERT’s answers to our questions:

Conclusion

Using the pre-trained BERT model available through NGC makes powerful, real-time, question and answer capabilities accessible to everyone. For the full code, or when you’re ready to experiment with your own Q&A system, consult this notebook. If you’d like to find out more about NGC, or get started with your own BERT model, visit the NGC Model Registry.

Authors:
Chris Parsons, Product Manager, NVIDIA
Ryan McCormick, Software Engineer, NVIDIA

How to use BERT Question Answering in TensorFlow with NVIDIA GPUs

What is BERT?

Pre-Trained NVIDIA BERT in NGC

BERT Inference: Question Answering

Conclusion

Written by NVIDIA AI