Training DeepSpeech using TorchElastic

Sean Narenthiran

Published in

PyTorch

4 min readAug 10, 2020

Reduce cost and horizontally scale deepspeech.pytorch using TorchElastic with Kubernetes.

End-to-End Speech To Text Models Using Deepspeech.pytorch

Deepspeech.pytorch provides training, evaluation and inference of End-to-End (E2E) speech to text models, in particular the highly popularised DeepSpeech2 architecture. Deepspeech.pytorch was developed to provide users the flexibility and simplicity to scale, train and deploy their own speech recognition models, whilst maintaining a minimalist design. Deepspeech.pytorch is a lightweight package for research iterations and integrations that fills the gap between audio research and production.

Scale Training Horizontally Using TorchElastic

Training production E2E speech-to-text models currently requires thousands of hours of labelled transcription data. In recent cases, we see numbers exceeding 50k hours of labelled audio data. To train with these datasets requires optimised multi-GPU training and hyper-parameters configurations. As we move towards leveraging unlabelled audio data for our speech recognition models with the announcement of wav2vec 2.0, scaling and throughput will continue to be crucial to train larger models across larger datasets.

Multiple advancements in the field have improved training iteration times, such as the growth of cuDNN, introduction of Automatic Mixed Precision and in particular, multi-machine training. Many implementations have appeared to assist in multi-machine training such as KubeFlow, but usually come with a vast feature set to replace the entire training workflow. Implementations from scratch require significant engineering efforts, and from experience do not offer the robustness required to reliably scale. TorchElastic provides native PyTorch scaling capabilities and fits the lightweight paradigm of deepspeech.pytorch whilst giving enough customisation and freedom to users. In essence, TorchElastic provides integration to scale training in PyTorch with minimal effort, saving time from having to implement complex custom infrastructure and accelerating research to production times.

Reduce Scaling Cost By Using Preemptible Instances

One method to reduce costs when scaling is to utilise preemptible instances; virtual machines that are not being utilised by on demand users can be obtained at a substantially lower price. Comparing NVIDIA V100 prices on Google Cloud, it’s a 3x cost saving. A DeepSpeech training run using the popular LibriSpeech dataset costs around $510 using V100s on Google Cloud. Utilizing preemptible instances reduces this to $153, a massive cost reduction allowing for more research training cycles.

However due to their short life cycle, preemptible instances come with the caveat that interruptions can happen anytime and your code needs to manage this. This can be complex based on the training pipeline, as tracking the state of training to re-initialise may require keeping track of vast amounts of variables. One way to solve this is to abstract the “state” of training to save, load and continue training upon failures in the cluster, making it simpler to track new variables in the future.

Implementing state in training deepspeech.pytorch. See in full here.

TorchElastic makes abstracting state really easy. Following example guidelines, the crucial functions to implement require state to be saved/resumed within the training code, then TorchElastic handles the rest. After integration, deepspeech.pytorch is able to handle interruptions seamlessly from previously saved state checkpoints. Deepspeech.pytorch also supports saving and loading state from a Google Cloud Storage bucket automatically, allowing us to mount read-only data drives to the node(s) and store our final models within an object store. TorchElastic cleans up a lot of boilerplate code, relieving the need to worry about distributed ranks, local GPU devices and distributed communication with other nodes.

deepspeech.pytorch Training Config

We rely on Kubernetes to handle interruptions and node management. TorchElastic supplies us PyTorch distributed integration to ensure we’re able to scale across GPU enabled machines using Elastic Jobs. With the TorchElastic Kubernetes Operator (TEK) we’re able to transparently integrate distributed deepspeech.pytorch within our K8s GKE cluster.

Supported are both fault tolerant jobs (where nodes can fail at any moment) as well as node pools that are dynamically changing based on demand. This is particularly useful when using an auto-scaling node pool of preemptible GPU instances to train DeepSpeech models, whilst utilising as much of the pool as possible.

Scaling and Managing Hyper-parameters

When node pools are dynamic, stability in hyper-parameters are key in handling variable sized compute pools. AdamW is a popular adaptive learning rate algorithm that provides stability when initially tuned, especially when node pools are dynamic and resources can be terminated or introduced at any time. When scaling to a substantial number of GPUs, fault tolerance has to be taken into consideration to ensure the pool is utilised and training completes even with disruptions. There are many other mini-batch specific and learning rate scheduler hyper-parameters that are also crucial, but thankfully with the recent addition of Hydra to deepspeech.pytorch, keeping track of these hyper-parameters is incredibly easy. In the future, deepspeech.pytorch will support various scaling hyper-parameter configurations for users to extend.

Future Steps

As dataset sizes increase and research continues to show exciting ways to leverage unlabelled audio with new architectures inspired from NLP, scaling and throughput will be key to training speech-to-text models. Deepspeech.pytorch aims to be transparent, simple and allow users to build and extend the library for their use cases.

Here are a few future directions for deepspeech.pytorch:

Integrate TorchScript to introduce seamless production integration across Python and C++ back-ends
Introduce Trains by Allegro AI for Visualisation and Job Scheduling
Benchmark DeepSpeech using the new A2 A100 VMs available on Google Cloud, for further throughput/cost benefits
Move towards abstracting the model, integrating recent advancements in model architectures such as ContextNet, and additional loss functions such as the ASG criterion

To get started with training your own DeepSpeech models using TorchElastic, have a look at the k8s template and the README. Feel free to reach out with any questions or create an issue here!