Announcing Lightning v1.5
Lightning 1.5 introduces Fault-Tolerant Training, LightningLite, Loops Customization, Lightning Tutorials, RichProgressBar, LightningCLI V2, with many more exciting features to be announced.
PyTorch Lightning v1.5 marks a significant leap of reliability to support the increasingly complex demands of the leading AI organizations and prestigious research labs that rely on Lightning to develop and deploy AI at scale.
PyTorch Lightning's ambition has never been greater as we aim to become the simplest and most flexible framework for expediting any deep learning research to production.
Following this vision, the Lightning v1.5 release has made progress in several vital directions: Improved Stability, Improved On-boarding, Advanced Flexibility, Quality of Life Improvements, and SOTA Experimental Features.
Find the complete release notes here.
Batch-Level Fault-Tolerant Training
Traditionally, training frameworks save checkpoints at the end of an epoch or after every N steps to recover in case of an accidental failure.
Lightning 1.5 extends this concept further by introducing a batch-level fault-tolerant training mechanism. When enabled and an unexpected failure occurs, Lightning users can resume a failed training from the same failed batch.
There is no need for the user to do anything beyond rerunning the script 🤯 ! In the future, this will enable Elastic Training with Lightning.
Learn more in the documentation.
Fault-tolerant Training - PyTorch Lightning 1.5.2 documentation
Warning Fault-tolerant Training is currently an experimental feature within Lightning. Fault-tolerant Training is an…
PyTorch 1.10 introduces
torch.bloat16 support for both CPUs/GPUs enabling more stable training compared to native Automatic Mixed Precision (AMP) with
To enable this in PyTorch Lightning, simply do the following:
Trainer Strategy API
PyTorch Lightning in v1.5 introduces a new
strategy flag enabling a cleaner distributed training API that also supports
acceleratorrefers to the hardware:
strategyrefers to how to utilize the hardware:
devicesrefers to how many devices of the type
Passing training strategies (e.g.
"ddp") to the
acceleratorhas been deprecated in v1.5.0 and will be removed in v1.7.0. Please use the
strategyargument as explained above.
TrainingTypePluginwill be renamed
PyTorch Lightning includes a registry that holds information about strategies and allows for the registration of new custom ones.
Additionally, you can pass your custom registered training type plugins to the
Lightning Lite | Stepping stone to Lightning
Do you want to keep complete control over your PyTorch code but face challenges with acceleration on CPU, GPUs, and TPUs, adding multi-node support, or mixed precision? Then, Lite is the right choice for you!
Here’s how Lightning Lite makes adding multi GPU training support easier than ever. See following 30-second animated graphic that shows you how to scale your code while maintaining control of your training loop.
Once you use LightningLite, you can now perform automatic accelerator and devices discovery and use the same code with GPUs or TPUs with any plugins such as DeepSpeed Zero 3.
Below, we have 5 MNIST examples showing how to convert from pure PyTorch to PyTorch Lightning using
- This script shows how to train a simple CNN over MNIST using vanilla PyTorch.
- This script shows you how to scale the previous script to enable GPU and multi-GPU training using
- This script shows you how to prepare your conversion from
- This script shows you the result of the conversion to the
LightningModuleand finally all the benefits you get from Lightning.
- This script shows you how to extract the data related components into a
LightningLite - Stepping Stone to Lightning - PyTorch Lightning 1.5.2 documentation
LightningLite enables pure PyTorch users to scale their existing code on any kind of device while retaining full…
The Lightning 1.5 docs are new and improved and contain a new course from the University of Amsterdam (UvA) that introduces the core concepts of the Deep Learning state of the art and will familiarize you with Lightning’s core features and ecosystem.
The course introduces the core concepts of the Deep Learning state of the art and will familiarize you with Lightning’s core features and ecosystem.
- Tutorial 1: Introduction to PyTorch
- Tutorial 2: Activation Functions
- Tutorial 3: Initialization and Optimization
- Tutorial 4: Inception, ResNet and DenseNet
- Tutorial 5: Transformers and Multi-Head Attention
- Tutorial 6: Basics of Graph Neural Networks
- Tutorial 7: Deep Energy-Based Generative Models
- Tutorial 8: Deep AutoEncoders
- Tutorial 9: Normalizing Flows for Image Modeling
- Tutorial 10: AutoRegressive Image Modeling
- Tutorial 11: Vision Transformers
- Tutorial 12: Meta-Learning — Learning to Learn
- Tutorial 13: Self-Supervised Contrastive Learning with SimCLR
Find the associated blog post to learn more.
Soon, PyTorch Lightning will be hosting a Lightning Tutorial Community Sprint to partner work with Academics from all over the world to enhance their deep learning curriculum by integrating with Lightning’s new tutorial capabilities. Here is the issue tracking the current sprint and the associated Google Form to apply.
LightningCLI V2, No Boilerplate For Reproducible AI
Running non-trivial experiments often requires configuring many different trainer and model arguments such as learning rates, batch sizes, number of epochs, data paths, data splits, number of GPUs, etc., that need to be exposed in a training script as most experiments are launched from command-line.
Implementing command-line tools using libraries such as Python’s standard library
argparse to manage hundreds of possible trainer, data, and model configurations is a huge source of boilerplate as follows:
This often leads to basic configurations being hard-coded and inaccessible for experimentation and reuse. Additionally, most of the configuration is duplicated in the signature and argument defaults, as well as docstrings and argument help messages.
Here is all you need to start using the
For Lightning v1.5, we have implemented a new notation to easily instantiate objects directly from the command line. This dramatically improves the command line experience as you can customize almost any aspect of your training by referencing only class names.
And this works with
But also with
Finally, you can register your own components with the
LightningCLI registries as follows:
PyTorch Lightning was created to do the hard work for you. The Lightning Trainer automates all the mechanics of the training, validation, and test routines. To create your model, all you need to do is define the architecture and the training, validation, and test steps and Lightning will make sure to call the right thing at the right time.
Internally, the Lightning Trainer relies on a series of nested loops to properly conduct the gradient descent optimization that applies to 90%+ of machine learning use cases. Even though Lightning provides hundreds of features, behind the scenes, it looks like this:
However, some new research use cases such as: meta-learning, active learning, cross-validation, recommendation systems, etc., require a different loop structure.
To resolve this, the Lightning Team implemented a general while-loop as a python class, the Lightning Loop. Here is its pseudo-code and its full implementation can be found there.
Using Loops has several advantages:
- You can replace, subclass, or wrap any loops within Lightning to customize their inner workings to your needs. This makes it possible to express any type of research with Lightning.
- The Loops are standardized and each loop can be isolated from its parent and children. With a simple loop, you might end up with more code, but when dealing with hundreds of features, this structure is the key to scale while preserving a high level of flexibility.
- The Loop can track its state and save its state within the model checkpoint. This is used with fault-tolerant training to enable auto restart.
Find the dedicated blog post here and its documentation. In the blog post, you will learn how the community created custom loops for Active Learning, Cross-Validation, Yield the LightningModule training step.
As part of our commitment to extensibility, we have abstracted the checkpointing logic into a
CheckpointIO plugin. This enables users to adapt Lightning to their infrastructure. Find the documentation here.
Quality of Life Improvements
Rich Progress Bar
We are excited to announce that Lightning now includes support RichProgressBar and RichModelSummary to make the command line training experience more visually appealing.
Rich is a Python library for rich text and beautiful formatting in the terminal
All you have to do is pass the
RichProgressBar callback to the Trainer, and Lightning handles the rest for you!
Both callbacks are easily extendable, allowing users to customize how the progress bar metrics and model summary table are displayed. You can finally customize it to your preferences. Here is our
Green Is Good theme.
SOTA Experimental Features
Init Meta Context
Right now, there is a race for creating larger and larger models. However, larger models don't fit on a single device. The current approach to scale to trillion size parameters is to shard the model, e.g., chunk its parameters, activations, and optimizer states as described within the Zero-3 paper. However, large models instantiation is still complicated as it requires all devices to be available and connected to perform the sharding.
To remediate this problem, PyTorch 1.10 introduced meta tensors. Meta tensors are like normal tensors, but they carry no data so there are no risks of OOMError.
Using meta tensor, it is possible to instantiate a meta-model and then materialize the model once all devices are connected.
This enables scaling minGPT to 45 Billion parameters with minimal code changes. Learn more here.
Inter Batch Parallelism
Inter Batch Parallelism enables to hide the latency of host-to-device copy of input batches behind computationally intensive operations as follows:
The associated speed-up can be pretty relevant when training a large recommendation engine with PyTorch Lightning. More information will be shared soon.
Enable this experimental feature as follows:
The Lightning Team is more than ever committed to providing the best experience possible to anyone doing optimization with PyTorch. With the PyTorch Lightning API being already stable, breaking changes will be minimal.
If you're interested in helping out with these efforts, find us on Slack!
Built by the PyTorch Lightning creators, let us introduce you to Grid.ai. Our platform enables you to scale your model training without worrying about infrastructure, similarly as Lightning automates the training.
You can get started with Grid.ai for free with just a GitHub or Google Account.
Grid.AI enables you to scale training from your laptop to the cloud without having to modify a single line of code. While Grid supports all the classic machine learning frameworks such as TensorFlow, Keras, and PyTorch, you can use any libraries you wish. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.