An interesting article and its code repo came from Nvidia on their MegatronLM model. You can read more about it at https://nv-adlr.github.io/MegatronLM.
This is by far the first PyTorch model parallelism implementation of a model as I’ve mentioned in one of my previous posts. They have taken an interesting approach in which instead of breaking into PyTorch codebase and tweaking it to support model parallelism, in general, they have nicely used existing tensors to create partitions of large tensors.
Again, this is NOT pipeline-parallelism as I discussed in the GPipe post. The idea is to split a tensor that would not fit in one GPU to multiple GPUs. Then GEMMs performed on these are accumulated with reductions. See the “Integrated Model and Batch Parallelism” paper for in-depth details.
It’s good to see the practical use of the “real” model parallelism, finally!