First DSSTNE Benchmarks TLDR: Up to Almost 15x Faster than TensorFlow

Today, I’d like to report the first official benchmarks of DSSTNE on training the MovieLens 20M view dataset. Upon its release, Amazon reported that DSSTNE is approximately 2.1X faster than an equivalent TensorFlow implementation on a single GPU of an AWS g2.8xlarge instance. The data presented today will make two further points:

  1. TensorFlow performance does not improve as efficiently as DSSTNE performance with improving GPU technology. While TensorFlow on a bare metal Tesla M40 is ~1.4x faster than on a single virtualized K520 GPU, DSSTNE is ~3.4x faster on a Tesla M40 than on said single K520 GPU (243% more efficient at exploiting the improved processing power of the Tesla M40). I can only attribute this to TensorFlow spending more time uploading data to the GPU than DSSTNE, which maintains all of its data in a compact numpy CSR-like format for sparse data, entirely within GPU memory. Keeping data GPU-resident is one of DSSTNE’s key design features given that Pascal GPUs and beyond will provide automagic unified memory management which makes complex streaming code (mostly) obsolete. Finally, it is interesting to note that DSSTNE on a single virtualized K520 GPU (released in 2012) is faster than TensorFlow on a bare metal Tesla M40 (released in 2015). Or as I guess Google would say, training sparse neural networks with DSSTNE instead of TensorFlow is “Advancing Moore’s Law by 3 years(tm)” j/k couldn’t resist.
  2. TensorFlow does not provide the automagic model parallelism provided by DSSTNE. Because of this, we only provide multi-GPU benchmark numbers for DSSTNE for now. But to be fair to the TensorFlow team, Amazon has checked in the source code to the TensorFlow benchmark measured here so that they can provide an improved implementation if they so desire. This Blog entry will be updated if they choose to do so. However, for now, on a g2.8xlarge instance, on which P2P memory copies are disabled and GPU/Device copies are extremeley slow, DSSTNE quad-GPU is ~4.8x faster than single GPU TensorFlow. Meanwhile, on a bare metal Tesla M40, DSSTNE is ~14.8x faster than TensorFlow on said Tesla M40.

Update 1/14/2017: not only does TensorFlow continue to not support automagic model parallelism, but the single GPU benchmark in TensorFlow now seg faults, sigh.

Update 4/10/2018: In the subsequent year or so, not a single other major framework added support for intra-layer model parallelism or sparse backpropagation. I am fascinated by this. Model parallelism not only allows one to train and operate larger models, but as the data here indicates, one can even scale small models by exploiting increased multi-GPU memory bandwidth and FLOPs whilst minimizing communication costs relative to data parallelism.




For K520 numbers, all benchmarks were run on a single g2.8xlarge AWS instance, using system memory MPI collectives to transmit data between virtualized GPUs.

For K80 numbers, all benchmarks were run bare metal.

Finally, for Tesla M40, all benchmarks were run on an Asus X99-E WS motherboard, powered by an Intel 5930K CPU, with 64GB of system memory and a 1500W power supply, essentially a DIY Digits Dev Box.

DSSTNE is available from

TensorFlow 0.8 was used for all measurements here. The TensorFlow benchmark code is available from

Finally, directions for running these benchmarks are available from