# Transparent Multi-GPU Training on TensorFlow with Keras

Keras should be getting a transparent data-parallel multi-GPU training capability pretty soon now, but in the meantime I thought I would share some code I wrote a month ago for doing data-parallel training without making any changes to your model definition.

As a preface to this, I would like to note that your model may not run any faster on multiple GPUs if you are not actually GPU bound; some cases where this can happen include when you use a generator with your data and it‘s creation is CPU/IO bound, or if your model is not particularly complex and you are Memory-bound when transferring data to your GPU.

In any case, once you have grabbed the make_parallel function from GitHub, you can turn your existing model into a Multi-GPU model with a single line change:

import numpy as np

import keras

from keras.models import Sequential

from keras.layers import Densemodel = Sequential()

model.add(Dense(4000, input_dim=8000, activation='tanh'))

model.add(Dense(2000, input_dim=8000, activation='relu'))

model.add(Dense(500, activation='relu'))

model.add(Dense(300, activation='relu'))

model.add(Dense(1, activation='sigmoid'))print (model.summary())model = make_parallel(model, 4)

optimizer = keras.optimizers.Adam(lr=0.0001)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])x = np.random.rand(131072, 8000)

y = np.random.randint(0, 2, (131072, 1))model.fit(x, y,batch_size=2048*4)

The one extra thing you need to note is that if you want to get better performance, you should multiply the usual size of your batches by the number of GPUs you are using.

When evaluating the performance, I found that I needed pretty big models to notice a difference between single GPU performance and multi-GPU performance; Keras tells me that the model I defined above has 41,157,101 parameters, and models that were significantly smaller didn’t seem to get much of a performance boost from the multi-GPU setting.

## Internals

make_parallel is a relatively simple function:

- It instantiates a copy of your model on the N GPUs you tell it to
- It splits your batch into N evenly sized smaller batches
- It passes each smaller batch into the corresponding model
- It concatenates the outputs of the models

The part that this is missing from most TensorFlow Multi-GPU tutorials is that we do not compute a separate loss function on each GPU and then average the results before applying the changes. We only have the single loss function defined for our model. To avoid having all of our gradients wind up on the same device, we pass in the colocate_gradients_with_ops flag to TensorFlow to ask it to compute the gradients on the GPU that each operation is placed on, though the performance benefit doesn’t seem large, so it’s not clear how well this approach is working.