# Multi-GPU training of Large, Sparse-Matrix on Wide #NeuralNetwork

Multi-GPU training of neural network on TensorFlow (v0.12 as of this blog) is a pain. It is a pain if you get off the beaten path that is. I found out a bit early that the SparseTensors in the contrib.learn package does not play well on GPUs. This is a tale of re-writing a separate utility on Keras/TensorFlow that plays well for very large sparse matrix on multiple GPUs.

(first published on InMobi blog here >> InMobi Tech)

I expected the contrib.learn packages to scale on GPUs. But found out that is not the case. You can check the issue reported here >> contrib.learn.Estimator multi-gpu issue

The specific piece of code that did not work for me is as follows:

Note that a large portion of input data which is loaded onto SparseTensor being not available on GPU defies the purpose of a multi GPU setup. Hence the issue is raised.

My Setup: I tested this on a (CUDA 8) — 3units of Titan-X Pascal on Ubuntu 16.04 with TensorFlow v0.12. (Nvidia Driver Version: 370.28)

#### The Solution: Large, Sparse-Matrix on Multi-GPU

After several attempts to help fix the SparseTensor in the TensorFlow library, for the lack of time, I resorted to write a sparse tensor utility myself for large sparse matrix. When I say large, I have tested the utility on 1.5 billion samples in input data, with about 20+ categorical features where some of the features have about 100K values per feature. Have tested this on 3 Tesla-X GPUs, and thoroughly satisfied with the results.

The link to the code is available here >> Wide Multi-GPU Network with Sparse Matrix. Feel free to use this code as you deem fit as a copy-left.

There are mainly 2 parts to this code.

- A quick-hash and sparse function for categorical features
- A multi-GPU parallelization routine

#### Sparse Hash

The objective of the sparse hash is 2 fold as shown in the code section

- Create a hash look-up table for unique feature values per feature.
- Create a hash to sparse matrix conversion routine.

Hashing Code:

I use a non-cryptographic hash to speed up on large payloads. xxHash is a extremely fast hash function with throughputs of 5.4GB/s or above.

Control the bucket size:

Have used a reduction of SQRT of the maximum length of total possible feature values in the sample space per feature. This is a good thumb-rule that reduces hash collision as I have observed over years. I need 4 such buckets per hash in the 4-hash-tuple received from the four_bit_split_hash procedure.

the 4-bit-split (or 4 hot encoding) is a good compressor for large sparse datasets while keeping hash collision in mind. I could not see a single collision for the buckets on 4-bit-split.

Sparse Conversion Code:

This is a simple routine that converts the hash to sparse-matrix. I use the row-wise *csr_matrix* from *scipy.sparse* module to do the trick. You can easily replace this with column-wise if your downstream operations are sharded differently. I presume most Neural Network data is row-wise sharded on key-codes per sample space (Unless you are resorting to Model Parallelism)

Also the *hstack* procedure is used to concatenate the splits in the hash which are converted to it’s own sparse array. Just for simplicity, I have also added a one-hot function.

At runtime, you have to convert the sparse matrix to a dense matrix to feed it to the Keras model (unfortunately, Keras does not yet support sparse feature as I understand). You can do so using the following procedure.

Note that I use the *toarray()* function on the sparse matrix to convert this to dense.

#### Multi-GPU parallelization

The multi-GPU parallelization can be done in 3 different ways.

- Tower parallelization: Here you load the data, cost functions and gradients and regularizers all on the same GPU. This allows you to run an exact replica of the entire model on multiple GPUs and then you can aggregate the results using a mixing routine or a simple mean.
- Data parallelization: You can strip the input data into ’N’ parts (row wise) (N being the number of GPUs you want to run on) and merge the resulting output into a single stream. I have used Keras for this exact reason as merging is fairly simpler in this library. This is similar to tower model, except that the data on different GPUs shall be different sample slices of the original data space. Each GPU gets the same model parameters but different data-slice. All other neural network functions will be a replica.
- Model parallelization: The idea here is to modularize the entire model-parameters into different section and spread it out into different GPU but send the same input dataset to different model-parameters on different GPU. I have stayed away from this for now as it can get quite dirty and complex pretty soon and hard to debug on a unstable backend (I am aware that TensorFlow released a rc 1.0. I haven’t checked the stability on that yet as of this blog)

Tower parallelization:

Line #24 is where I am loading the exact same replica of the model across multiple GPU

Line #27 is a Keras Lambda layer that allows the mix of outputs as defined in the prediction_mean function (line #15). I am just taking the mean of the outputs for now. This can be converted to a gaussian mix as explained in the mixture of experts in the past post titled : Committee of Intelligent Machines.

Data Parallelization:

On Line #34, we split the batch to multiple slices to fit to as many GPUs as suggested.

Line #60 merges all the outputs from different GPU into a single output post the final layer of the model.

CAUTION: I am not able to successfully run this code on a linear regressor consistently without getting a ‘nan’ on the regression loss function during model evaluation (NOTE: This is a problem during evaluation and NOT during training and is not consistent). I believe this is a Keras issue and have raised the concern here >> Loss turns into ‘nan’ when running on GPU

Here is a sample template to use the utility. Its incomplete for brevity…

On Line #62 in the figure, creates a feature dictionary with all the features and feature_length (to be used by the hashing and sparse functions). The feature length is the maximum sample space per feature.

Line #71, obtain the model by passing the feature dictionary along with type of parallelization needed (Defaults to Tower)

Line #73, obtain the hash lookup for all uniques

Line #75, get the sparse matrix and convert it to dense matrix with categorical encoding.

Line #77, train the model with the dense matrix.

#### Model Usage

Note that the same model architecture should be used if you are planning to save_weights and load_weights of the trained model. Also, needles to state, change in feature lengths shall change the model architecture by changing the input size drastically.

Hope this helps reduce your workloads and improve model throughputs. Specifically for my training, I could reduce 48hr model training workloads to nearly 10–12 hours. I shall take that any-day !

**Now, set those GPUs on fire people ! figuratively…**