Keras + Horovod = Distributed Deep Learning on Steroids

Keras is definitely the weapon of choice when it comes to building deep learning models ( with tensorflow backend ). At SearchInk, we are solving varied problems in the field of document analysis by architecting and implementing deep learning models. One of the bigger challenges in doing this is the time taken to run each experiment. With the need for more and more experimentation to be carried out in shorter spans of time, we decided it was the right time for us to start distributed computations on the GPU for deep learning models.

We were evaluating different options on how to perform distributed GPU computing and we stumbled upon Horovod. We spent a few days evaluating the approach Uber had taken and it made perfect sense for us to be one of the early adopters of Horovod.

Things to look out when running Horovod:

Ensure that you have keras 2.0.8 and not 2.0.9 because 2.0.9 there is a known issue that makes each worker allocate all GPUs on the server instead of the GPU assigned by the local rank.

Also, when running the mpirun command, please use the following options:

$ mpirun --oversubscribe --bind-to none -np <number of GPU's>
oversubscribe - Nodes are allowed to be oversubscribed, even on a managed system, and overloading of processing elements.
bind-to none - Not binding the process to any cores
np - Run this many copies of the program on the given nodes
$ man mpirun 
For more details on the mpirun command and its option

Example With Keras Inception3 :

  1. Adding Horovod optimizer to the model
def create_inception_model(num_classes, input_shape, fully_trainable=False):
'''
A method to create inception 3 model with horovod optimizer
:param num_classes:
:param input_shape:
:param fully_trainable:
:return:
'''

base_model = InceptionV3(include_top=False, input_tensor=Input(input_shape),
weights=None, pooling='avg')
pred_name = "predictions_{}".format(num_classes)

x = base_model.output
predictions = Dense(num_classes, activation='softmax', name=pred_name)(x)

# create graph of your new model
model = Model(input=base_model.input, output=predictions)

for l in model.layers:
l.trainable = fully_trainable

# Adjust learning rate based on number of GPUs (naive approach).
opt = keras.optimizers.Adadelta()

# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)

model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
return model

2. Initializing Horovod and adding required callbacks:

#adding horovod
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
file_list = build_data(data_path)
train_size = math.ceil(0.8 * len(file_list))
train_file_list = file_list[:train_size]
test_file_list=file_list[train_size:]
nbatches_train, mod = divmod(len(train_file_list),
BATCH_SIZE)
nbatches_valid, mod = divmod(len(test_file_list),
BATCH_SIZE)
model = create_inception_model(num_classes=4, 
input_shape=(HEIGHT, WIDTH, CHANNELS),
fully_trainable=True)
cb = []
if hvd.rank()  == 0:
modelCheckPointCallBack = keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5')
cb.append(keras.callbacks.TensorBoard('./logs'))
cb.append(modelCheckPointCallBack)
cb.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
cb.append(hvd.callbacks.MetricAverageCallback())
cb.append(hvd.callbacks.LRWarmupCallback(warmup_epochs=5, verbose=1))
cb.append(keras.callbacks.ReduceLROnPlateau(patience=3, verbose=1))
train_generator = generator_from_data_path(train_file_list,
batch_size=BATCH_SIZE,
width=WIDTH,
height=HEIGHT)
test_generator = generator_from_data_path(test_file_list,
batch_size=BATCH_SIZE,
width=WIDTH,
height=HEIGHT)
model.fit_generator(train_generator, epochs=EPOCHS,
steps_per_epoch=nbatches_train // hvd.size(),
callbacks=cb,
validation_data=test_generator,
validation_steps=3 * nbatches_valid // hvd.size(),
verbose=1)
serialize_mode(model, name='inception')

Voila! we are ready to run Inception3 on multiple GPU’s. You can check the usage of GPU’s by running:

watch nvidia-smi
Distributed GPU-Computation on 4 Tesla K80 GPU’s

In our next blog, we will be showcasing distributed GPU computation with TensorFlow using Horovod.

We would like to thank the Engineering team at Uber for making Horovod
an open source project. This is definitely great progress in the right direction.

Takeaways :

To put things into perspective, we were running an Inception3 architecture with a sample of 18 thousand documents on a 1 * 12GB Tesla K80 GPU. 
Each epoch took about 30 minutes. With Horovod and an upgraded instance with 4 * 12GB Tesla K80 GPU, reduced each epoch to about 5–6 minutes.