10 Python Frameworks for Parallel and Distributed Machine Learning Tasks
Python Libraries that Enable Capabilities to Distribute and Parallelize ML Tasks

Nowadays, Neural network models are very deep and complicated with so many weights to learn. Training such models is very challenging. Data scientists need to set up distributed training, checkpointing, etc. Even after that, data scientists may not achieve the desired performance and convergence rate. Training large models is even more challenging in that the model easily runs out of memory.
In this article, we will see a list of Python Frameworks that allow us to Distribute and Parallelize the Deep Learning models.
1. Elephas
Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. Elephas intends to keep the simplicity and high usability of Keras, thereby allowing for fast prototyping of distributed models, which can be run on massive data sets.
Elephas currently supports a number of applications, including:
- Data-parallel training of deep learning models
- Distributed hyper-parameter optimization
- Distributed training of ensemble models

Installation
pip install elephas
2. FairScale
FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library extends basic PyTorch capabilities while adding new experimental ones.
FairScale supports:
- Parallelism
- Sharded training
- Optimization at scale
- GPU memory optimization
- GPU speed optimization
Installation
pip install fairscale
3. TensorFlowOnSpark
By combining salient features from the TensorFlow deep learning framework with Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.
It enables both distributed TensorFlow training and inferencing on Spark clusters, with a goal to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid.
TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo’s private cloud.
Installation
pip install tensorflowonspark
4. DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
- Extreme-scale: Using the current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
- Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
Installation
pip install deepspeed
5. Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Once a training script has been written for scale with Horovod, it can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes.
In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 128 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

Installation
#To run on CPUs
pip install horovod#To run on GPUs with NCCL
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
6. Mesh TensorFlow
Mesh TensorFlow (mtf
) is a language for distributed deep learning, capable of specifying a broad class of distributed tensor computations. The purpose of Mesh TensorFlow is to formalize and implement distribution strategies for your computation graph over your hardware/processors. For example: "Split the batch over rows of processors and split the units in the hidden layer across columns of processors." Mesh TensorFlow is implemented as a layer over TensorFlow.
Installation
pip install mesh-tensorflow
7. BigDL
BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters. To make it easy to build Spark and BigDL applications, a high-level Analytics Zoo is provided for end-to-end analytics + AI pipelines.
- Rich deep learning support. Modelled after Torch, BigDL provides comprehensive support for deep learning, including numeric computing (via Tensor) and high-level neural networks; in addition, users can load pre-trained Caffe or Torch models into Spark programs using BigDL.
- Extremely high performance. To achieve high performance, BigDL uses Intel MKL / Intel MKL-DNN and multi-threaded programming in each Spark task. Consequently, it is orders of magnitude faster than out-of-box open-source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU). With the adoption of Intel DL Boost, BigDL improves inference latency and throughput significantly.
- Efficiently scale-out. BigDL can efficiently scale out to perform data analytics at “Big Data scale”, by leveraging Apache Spark (a lightning-fast distributed data processing framework), as well as efficient implementations of synchronous SGD and all-reduce communications on Spark.
Installation
pip install BigDL
8. Analytics Zoo
Analytics Zoo seamless scales TensorFlow, Keras and PyTorch to distributed big data (using Spark, Flink & Ray).
- End-to-end pipeline for applying AI models (TensorFlow, PyTorch, OpenVINO, etc.) to distributed big data
- Write TensorFlow or PyTorch inline with Spark code for distributed training and inference.
- Native deep learning (TensorFlow/Keras/PyTorch/BigDL) support in Spark ML Pipelines.
- Directly run Ray programs on big data cluster through RayOnSpark.
- Plain Java/Python APIs for (TensorFlow/PyTorch/BigDL/OpenVINO) Model Inference.
- High-level ML workflow for automating machine learning tasks
- Cluster Serving for automatically distributed (TensorFlow/PyTorch/Caffe/OpenVINO) model inference.
- Scalable AutoML for time series prediction.
- Built-in models for Recommendation, Time Series, Computer Vision and NLP applications.

Installation
pip install analytics-zoo
9. Petastorm
Petastorm is an open-source data access library developed at Uber ATG. This library enables a single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark. It can also be used from pure Python code.
Installation
pip install petastorm
10. Apache SINGA
Apache SINGA is an Apache Top Level Project, focusing on distributed training of deep learning and machine learning models.
- Scalability: SINGA parallelizes the training and optimizes the communication cost to improve training scalability.
- Efficiency: SINGA builds a computational graph to optimizes the training speed and memory footprint.
- Usability: SINGA has a simple software stack and Python interface to improve usability.
Installation
#CPU only
conda install -c nusdbsystem -c conda-forge singa-cpu=3.1.0#GPU with CUDA and cuDNN (CUDA driver >=384.81 is required)
conda install -c nusdbsystem -c conda-forge singa-gpu=3.1.0
Thank you for reading!
Any feedback and comments are, greatly appreciated!