MLPerf: Getting your feet wet with benchmarking ML workloads

Published in

Analytics Vidhya

7 min readNov 24, 2019

Benchmark suite for measuring training and inference performance of ML hardware, software, and services

This article covers the steps involved in setting up and running one of the MLPerf training benchmarks. This will provide the reader with a basic understanding of how to get started with MLPerf efficiently by leveraging the work done by previous submitters. MLPerf is becoming the de-facto ML workload for interesting experiments comparing different types of specialized infrastructure or software frameworks. This post will also familiarize one looking to submit their own MLPerf results with steps involved in running the benchmark. Whatever the end goal, let’s get started with how you can start executing MLPerf benchmarks on your hardware: whether it is on-premises or in the cloud.

In this first post of the series, we will cover how to run one of the MLPerf training benchmarks on a GPU server. Training a complex Deep Neural Network (DNN) model requires enormous amount of compute resources to complete in a reasonable amount of time. A comprehensive overview of the MLPerf Training benchmark suite, submission and review process is described in the paper published by the MLPerf community¹.

System under test (hardware environment): I use a GPU server equipped with 4 NVIDIA Tesla V100 PCIe GPU accelerators to demonstrate how to run one of the MLPerf training benchmark.

Software Environment: MLPerf requires that all benchmark submissions use containers to package all the software dependencies and provide scripts to download and prepare the dataset. This enables the community to easily replicate or run the benchmark on other systems. If you are not familiar with docker, there are numerous resources and tutorials to get started down this path such as this tutorial from Microsoft ML team.

The benchmark can be run with the following steps:

Setup docker & software dependencies on the system under test. There are various resources on the web to do this, for the GPU server I had to install Docker and Nvidia-Docker. Some benchmarks may have additional setup, mentioned in their READMEs.
Download the software repository for the benchmark which includes the code, scripts and documentation necessary to run the benchmark from the MLPerf GitHub repo: https://github.com/mlperf
Download and verify the dataset using the scripts provided in the benchmark directory. This is run outside of docker, on the system under test.
Build and run the docker image, using scripts and instructions included with each benchmark. Each benchmark will run until the target quality is reached and then stop, printing timing results and additional information will be captured in alog file.

Step by Step Instructions to run a training benchmark

First, clone the latest MLPerf training results repository as shown below. For training benchmarks it is recommended to use one of the existing result submissions (training_results_v0.5 or training_results_v0.6) as opposed to the reference implementation provided in the mlperf/training repository. This is because the reference code is an alpha release and not intended to be used for actual performance measurements of software frameworks or hardware.

git clone https://github.com/mlperf/training_results_v0.6.git

Next, let’s explore the downloaded code and locate the scripts to download the dataset, build and run the docker container etc. At the top level, there are directories for each vendor submission (Google, Intel, NVIDIA etc) which contain the code and scripts used to generate the results that they submitted. We will focus on the NVIDIA submission, since we want to run the benchmark on NVIDIA GPUs

$/home/training_results_v0.6$ ls
Alibaba  CONTRIBUTING.md  Fujitsu  Google  Intel  LICENSE  NVIDIA  README.md$/home/training_results_v0.6$ cd NVIDIA$/home/training_results_v0.6/NVIDIA$ ls
benchmarks  LICENSE.md  README.md  results  systems$/home/training_results_v0.6/NVIDIA$ cd benchmarks; ls
gnmt  maskrcnn  minigo  resnet  ssd  transformer

Within the NVIDIA/benchmarks directory we see 6 different training benchmarks. Let’s pick the first benchmark “GNMT” which is an recurrent neural network similar model to the one from Google² to perform language translation. Documentation on software requirements, details around the dataset and pre-processing performed and steps to run it on single and multi-node systems is provided from NVIDIA.

Since we are interested in running the benchmark on a single node, we will pick the submitted result for a single node (NVIDIA DGX-1) and use its documentation to run GNMT on our system.

Download and verify dataset

The scripts to download and verify the dataset is available inside the implementations directory. Run the script to download and prepare the dataset, which should take ~90 minutes depending on your network connection and requires around 1.2GB of file system space. Verify that the dataset has been correctly downloaded by executing the second script.

$/home/training_results_v0.6/NVIDIA$ cd gnmt/implementations; ls
download_dataset.sh  pytorch  verify_dataset.sh$/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations$ bash download_dataset.sh$/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations$ bash verify_dataset.sh
OK: correct data/train.tok.clean.bpe.32000.en
OK: correct data/train.tok.clean.bpe.32000.de
OK: correct data/newstest_dev.tok.clean.bpe.32000.en
OK: correct data/newstest_dev.tok.clean.bpe.32000.de
OK: correct data/newstest2014.tok.bpe.32000.en
OK: correct data/newstest2014.tok.bpe.32000.de
OK: correct data/newstest2014.de

Launch training jobs

The scripts and code to execute the training job is inside the pytorch directory. Let’s explore the files within this directory.

$/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations$ ls
data  download_dataset.sh  logs  pytorch  verify_dataset.sh
$/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations$cd pytorch; ls -l
bind_launch.py
config_DGX1_multi.sh
config_DGX1.sh
config_DGX2_multi_16x16x32.sh
config_DGX2_multi.sh
config_DGX2.sh
Dockerfile
LICENSE
mlperf_log_utils.py
preprocess_data.py
README.md
requirements.txt
run_and_time.sh
run.sub
scripts
seq2seq
setup.py
train.py
translate.py

config_<system>.sh: Since we are executing the training job on a system with 4 GPUs, we will have to create a new config file to reflect our system configuration. If your system has 8 or 16 GPUs, you can use the existing config_DGX1.sh or config_DGX2.sh config file to launch the training job.

I create a new config file config_SUT.sh (by copying config_DGX1.sh) and edit it to reflect my system configuration. In this case I only needed to change the number of GPUs from 8 to 4. You may have to change the number of CPU cores and sockets to reflect the available CPU resources on your system.

$training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/pytorch$cp config_DGX1.sh config_SUT.shEdit config_SUT.sh to reflect your system config## System config params
DGXNGPU=4
DGXSOCKETCORES=20
DGXHT=2         # HT is on is 2, HT off is 1
DGXIBDEVICES=''
DGXNSOCKET=2
BIND_LAUNCH=1

Now you are ready to build the docker container and launch the training job. Replace <docker/registry> with your docker hub registry name so you can reuse the container image on other systems or in a multi-node run.

Dockerfile This the build file for the docker container that will be used to execute the training job

docker build -t <docker/registry>/mlperf-nvidia:rnn_translator .docker push <docker/registry>/mlperf-nvidia:rnn_translator

If you don’t have a docker hub account, you can save it on the local system and not specify <docker/registry>

docker build -t mlperf-nvidia:rnn_translator .

Launch the container and look at its contents, verify that it has the config_SUT.sh file etc

nvidia-docker run -it --rm mlperf-nvidia:rnn_translator
root@4e944d91164e:/workspace/rnn_translator# ls -l *.sh
config_DGX1.sh
config_DGX1_multi.sh
config_DGX2.sh
config_DGX2_multi.sh
config_DGX2_multi_16x16x32.sh
config_SUT.sh
run_and_time.sh

Once you have verified that the right config file is available inside the newly created docker container, we are now ready to execute the training job using the launch script run.sub and setting the environment variables for dataset, log files and the config file

DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> PULL=0 DGXSYSTEM=<config file> ./run.sub

For my test, I will using config_SUT.sh and therefore specify DGXSYTEM as SUT. I created a new directory “logs” to store the benchmark log files and specify the path when launching the benchmark run as specified below:

DATADIR=/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/data LOGDIR=/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/logs DGXSYSTEM=SUT PULL=0 ./run.sub

If everything goes well, you should be off to the races and it will execute 10 trial runs of the benchmark and store the log files in the specified directory. Since we specified 4 GPUs in the config file, we see that all 4 GPUs are being utilized for training the GNMT model.

The benchmark run time is available in the log file. The results for my runs are displayed below. The average time for a run was 90 minutes and it took close to 15 hours to finish running 10 iterations. You can modify run.sub to limit the number of runs if you don’t wish to run all 10 iterations.

$/home/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/logs$ grep RNN_TRANSLATOR *.log
1.log:RESULT,RNN_TRANSLATOR,,3795,nvidia,2019-11-22 09:25:23 PM
2.log:RESULT,RNN_TRANSLATOR,,4683,nvidia,2019-11-22 10:28:56 PM
3.log:RESULT,RNN_TRANSLATOR,,3807,nvidia,2019-11-22 11:47:17 PM
4.log:RESULT,RNN_TRANSLATOR,,5594,nvidia,2019-11-23 12:51:02 AM
5.log:RESULT,RNN_TRANSLATOR,,6473,nvidia,2019-11-23 02:24:33 AM
6.log:RESULT,RNN_TRANSLATOR,,5576,nvidia,2019-11-23 04:12:43 AM
7.log:RESULT,RNN_TRANSLATOR,,6484,nvidia,2019-11-23 05:45:57 AM
8.log:RESULT,RNN_TRANSLATOR,,4683,nvidia,2019-11-23 07:34:19 AM
9.log:RESULT,RNN_TRANSLATOR,,6481,nvidia,2019-11-23 08:52:40 AM
10.log:RESULT,RNN_TRANSLATOR,,5580,nvidia,2019-11-23 10:40:59 AM

The other training benchmarks in the remaining directories (mask-rcnn, minigo, resnet, ssd and transformer) can be run using similar steps — downloading the dataset, building and running the docker container. You can use the MLPerf training benchmarks to compare different GPU systems or evaluate different software frameworks etc. As an example, you can evaluate impact of storage subsystem on ML workloads using MLPerf³ or evaluate how to improve MLPerf benchmarking metrics⁴.

[1] MLPerf Training Benchmark, Oct 2019 https://arxiv.org/pdf/1910.01500.pdf[2] Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Oct 2016
https://arxiv.org/abs/1609.08144[3]   Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite,   Wes Vaske
  https://www.flashmemorysummit.com/Proceedings2019/08-08-Thursday/20190808_AIML-301-1_Vaske.pdf[4]  Metrics for Machine Learning Workload Benchmarking, Snehil Verma et al
  https://researcher.watson.ibm.com/researcher/files/us-ealtman/Snehil_Metrics_for_Machine_Learning_Workload_Benchmarking.pdf

MLPerf: Getting your feet wet with benchmarking ML workloads

Download and verify dataset

Launch training jobs

Written by Ramesh Radhakrishnan