Multi-GPU training with Keras on Onepanel.io

Published in

Onepanel

8 min readJun 3, 2019

In this blog post we well be demonstrating how to train a Convolutional Neural Network for image classification using Keras.

The MiniGoogLeNet deep learning architecture

Figure 1: The MiniGoogLeNet architecture is a small version of it’s bigger brother, GoogLeNet/Inception.

In Figure 1 above we can see the individual convolution (left), inception (middle), and downsample (right) modules, followed by the overall MiniGoogLeNet architecture (bottom), constructed from these building blocks. We will be using the MiniGoogLeNet architecture in our multi-GPU experiments later in this post.

The Inception module in MiniGoogLenet is a variation of the original Inception module designed by Szegedy et al.

Now, lets proceeded to implement the MiniGoogLeNet architecture in Keras + Python.

Training a deep neural network with Keras and multiple GPUs

Let’s go ahead and get started training a deep learning network using Keras and multiple GPUs.

Lets Define the Model First, open up a new file, name it net.py , and insert the following code:

# import the necessary packages
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import AveragePooling2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Dropout
from keras.layers.core import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.models import Model
from keras.layers import concatenate
from keras import backend as Kclass MiniGoogLeNet:
    @staticmethod
    def conv_module(x, K, kX, kY, stride, chanDim, padding="same"):
      # define a CONV => BN => RELU pattern
      x = Conv2D(K, (kX, kY), strides=stride, padding=padding)(x)
      x = BatchNormalization(axis=chanDim)(x)
      x = Activation("relu")(x)# return the block
      return x@staticmethod
    def inception_module(x, numK1x1, numK3x3, chanDim):
      # define two CONV modules, then concatenate across the
      # channel dimension
      conv_1x1 = MiniGoogLeNet.conv_module(x, numK1x1, 1, 1,(1, 1), chanDim)
      conv_3x3 = MiniGoogLeNet.conv_module(x, numK3x3, 3, 3,
   (1, 1), chanDim)
      x = concatenate([conv_1x1, conv_3x3], axis=chanDim)# return the block
      return x@staticmethod
 def downsample_module(x, K, chanDim):
  # define the CONV module and POOL, then concatenate
  # across the channel dimensions
  conv_3x3 = MiniGoogLeNet.conv_module(x, K, 3, 3, (2, 2),
   chanDim, padding="valid")
  pool = MaxPooling2D((3, 3), strides=(2, 2))(x)
  x = concatenate([conv_3x3, pool], axis=chanDim)# return the block
  return x@staticmethod
 def build(width, height, depth, classes):
  # initialize the input shape to be "channels last" and the
  # channels dimension itself
  inputShape = (height, width, depth)
  chanDim = -1# if we are using "channels first", update the input shape
  # and channels dimension
  if K.image_data_format() == "channels_first":
   inputShape = (depth, height, width)
   chanDim = 1# define the model input and first CONV module
  inputs = Input(shape=inputShape)
  x = MiniGoogLeNet.conv_module(inputs, 96, 3, 3, (1, 1),
   chanDim)# two Inception modules followed by a downsample module
  x = MiniGoogLeNet.inception_module(x, 32, 32, chanDim)
  x = MiniGoogLeNet.inception_module(x, 32, 48, chanDim)
  x = MiniGoogLeNet.downsample_module(x, 80, chanDim)# four Inception modules followed by a downsample module
  x = MiniGoogLeNet.inception_module(x, 112, 48, chanDim)
  x = MiniGoogLeNet.inception_module(x, 96, 64, chanDim)
  x = MiniGoogLeNet.inception_module(x, 80, 80, chanDim)
  x = MiniGoogLeNet.inception_module(x, 48, 96, chanDim)
  x = MiniGoogLeNet.downsample_module(x, 96, chanDim)# two Inception modules followed by global POOL and dropout
  x = MiniGoogLeNet.inception_module(x, 176, 160, chanDim)
  x = MiniGoogLeNet.inception_module(x, 176, 160, chanDim)
  x = AveragePooling2D((7, 7))(x)
  x = Dropout(0.5)(x)# softmax classifier
  x = Flatten()(x)
  x = Dense(classes)(x)
  x = Activation("softmax")(x)# create the model
  model = Model(inputs, x, name="googlenet")# return the constructed network architecture
  return model

From there, open up a new file, name it train.py , and insert the following code:

# set the matplotlib backend so figures can be saved in the 
# background(uncomment the lines below if you are using a headless server)# import matplotlib
# matplotlib.use(“Agg”)
# import the necessary packagesfrom pyimagesearch.minigooglenet import MiniGoogLeNet
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import LearningRateScheduler
from keras.utils.training_utils import multi_gpu_model
from keras.optimizers import SGD
from keras.datasets import cifar10
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import argparse

Now let’s parse our command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument(“-o”, “ — output”, required=True,
help=”path to output plot”)
ap.add_argument(“-g”, “ — gpus”, type=int, default=1,
help=”# of GPUs to use for training”)
args = vars(ap.parse_args())
# grab the number of GPUs and store it in a conveience variable
G = args[“gpus”]

We use argparse to parse one required and one optional argument:

— output : The path to the output plot after training is complete.
— gpus : The number of GPUs used for training.

From there, we initialize two important variables used to configure our training process, followed by defining poly_decay , a learning rate schedule function equivalent to Caffe’s polynomial learning rate decay:

# definine the total number of epochs to train for along with the
# initial learning rateNUM_EPOCHS = 70
INIT_LR = 5e-3def poly_decay(epoch):
    # initialize the maximum number of epochs, base learning rate,
    # and power of the polynomial
    maxEpochs = NUM_EPOCHS
    baseLR = INIT_LR
    power = 1.0
    # compute the new learning rate based on polynomial decay
    alpha = baseLR * (1 — (epoch / float(maxEpochs))) ** power
    # return the new learning rate
    return alpha

We set NUM_EPOCHS = 70 and learning rate INIT_LR = 5e-3.

From there, we define the poly_decay function which is the equivalent of Caffe’s polynomial learning rate decay. Essentially this function updates the learning rate during training, effectively reducing it after each epoch. Setting the power = 1.0 changes the decay from polynomial to linear.

Next we’ll load our training + testing data and convert the image data from integer to float:

# load the training and testing data, converting the images from
# integers to floatsprint(“[INFO] loading CIFAR-10 data…”)
((trainX, trainY), (testX, testY)) = cifar10.load_data()
trainX = trainX.astype(“float”)
testX = testX.astype(“float”)

From there we apply mean subtraction to the data:

# apply mean subtraction to the datamean = np.mean(trainX, axis=0)
trainX -= mean
testX -= mean

Here, we calculate the mean of all training images followed subtracting the mean from each image in the training and testing sets.

Then, we perform “one-hot encoding” :

# convert the labels from integers to vectorslb = LabelBinarizer()
trainY = lb.fit_transform(trainY)
testY = lb.transform(testY)

One-hot encoding transforms categorical labels from a single integer to a vector so we can apply the categorical cross-entropy loss function.

Next, we create a data augmenter and set of callbacks:

# construct the image generator for data augmentation and construct# the set of callbacksaug = ImageDataGenerator(width_shift_range=0.1,
      height_shift_range=0.1, horizontal_flip=True,
      fill_mode=”nearest”)
callbacks = [LearningRateScheduler(poly_decay)]

Finally, we construct the image generator for data augmentation.

Data augmentation is a method used during the training process where we randomly alter the training images by applying random transformations to them.

Because of these alterations, the network is constantly seeing augmented examples — this enables the network to generalize better to the validation data while perhaps performing worse on the training set. In most situations these trade off is a worthwhile one.

We create a callback function which will allow our learning rate to decay after each epoch , “poly_decay” .

Let’s check that GPU variable next:

# check to see if we are compiling using just a single GPU
if G <= 1:
    print(“[INFO] training with 1 GPU…”)
    model = MiniGoogLeNet.build(width=32, height=32, depth=3,      classes=10)

If the GPU count is less than or equal to one, we initialize the model via the .build function otherwise we’ll parallelize the model during training:

# otherwise, we are compiling using multiple GPUs
else:
    print(“[INFO] training with {} GPUs…”.format(G))
    # we’ll store a copy of the model on *every* GPU and then combine
    # the results from the gradient updates on the CPU
    with tf.device(“/cpu:0”):
    # initialize the model
    model = MiniGoogLeNet.build(width=32, height=32, depth=3,
classes=10)
    # make the model parallel
    model = multi_gpu_model(model, gpus=G)

Creating a multi-GPU model in Keras requires some bit of extra code, but not much!

# initialize the optimizer and model
print(“[INFO] compiling model…”)
opt = SGD(lr=INIT_LR, momentum=0.9)
model.compile(loss=”categorical_crossentropy”, optimizer=opt,
metrics=[“accuracy”])
# train the network
print(“[INFO] training network…”)
H = model.fit_generator(aug.flow(trainX, trainY, batch_size=64 * G),
    validation_data=(testX, testY),steps_per_epoch=len(trainX)
    //(64  * G),epochs=NUM_EPOCHS,callbacks=callbacks, verbose=2)

we build a Stochastic Gradient Descent (SGD) optimizer. Subsequently, we compile the model with the SGD optimizer and a categorical crossentropy loss function.

We’re now ready to train the network!

To initiate the training process, we make a call to model.fit_generator and provide the necessary arguments.

We’d like a batch size of 64 on each GPU so that is specified by

batch_size=64 * G .

Our training will continue for 70 epochs (which we specified previously).

The results of the gradient update will be combined on the CPU and then applied to each GPU throughout the training process.

Now that training and testing is complete, let’s plot the loss/accuracy so we can visualize the training process:

# grab the history object dictionary
H = H.history
# plot the training loss and accuracy
N = np.arange(0, len(H[“loss”]))
plt.style.use(“ggplot”)
plt.figure()
plt.plot(N, H[“loss”], label=”train_loss”)
plt.plot(N, H[“val_loss”], label=”test_loss”)
plt.plot(N, H[“acc”], label=”train_acc”)
plt.plot(N, H[“val_acc”], label=”test_acc”)
plt.title(“MiniGoogLeNet on CIFAR-10”)
plt.xlabel(“Epoch #”)
plt.ylabel(“Loss/Accuracy”)
plt.legend()
# save the figure
plt.savefig(args[“output”])
plt.close()

This last block simply uses matplotlib to plot training/testing loss and accuracy , and then saves the figure to disk.

Keras multi-GPU results

Let’s check the results of our hard work.

Let’s train on a single GPU to obtain a baseline:

python train.py — output single_gpu.png
[INFO] training with 1 GPUs...
[INFO] compiling model...
[INFO] training network...

Experimental results from training and testing MiniGoogLeNet network architecture on CIFAR-10 using Keras on a single V100 GPU.

Training on a single V100 GPU on Onepanel. Each epoch took ~63 seconds with a total training time of 74m10s.

Lets train on a Cluster of V100’s (4 GPUs):

python train.py — output single_gpu.png
[INFO] training with 4 GPUs...
[INFO] compiling model...
[INFO] training network...

Multi-GPU training results (4 V100 GPUs) using Keras and MiniGoogLeNet on the CIFAR10 dataset. Training results are similar to the single GPU experiment while training time was cut by ~75%.

Here you can see the quasi-linear speed up in training: Using four GPUs, I was able to decrease each epoch to only 16 seconds. The entire network finished training in 19m3s.

As you can see, not only is training deep neural networks with Keras and multiple GPUs easy, it’s also efficient as well!

Note: In this case, the single GPU experiment obtained slightly higher accuracy than the multi-GPU experiment. When training any stochastic machine learning model, there will be some variance. If you were to average these results out across hundreds of runs they would be (approximately) the same.

Summary

In this blog post we learned how to use multiple GPUs to train Keras-based deep neural networks.

Using multiple GPUs enables us to obtain quasi-linear speedups.

To validate this, we trained MiniGoogLeNet on the CIFAR-10 dataset.

Using a single GPU we were able to obtain 63 second epochs with a total training time of 74m10s.

However, by using multi-GPU training with Keras and Python we decreased training time to 16 second epochs with a total training time of 19m3s.

Try out Training a Model using a GPU cluster on Onepanel.io , click here.

Multi-GPU training with Keras on Onepanel.io

The MiniGoogLeNet deep learning architecture

Training a deep neural network with Keras and multiple GPUs

Keras multi-GPU results

Summary

Written by Joinal Ahmed