Authors: Oleksii Kuchaiev (Senior Applied Scientist, NVIDIA), Poonam Chitale (Senior Product Manager, NVIDIA)
Conversational AI is changing the way we interact with computers. It comprises three exciting areas of artificial intelligence (AI) research: automatic speech recognition (ASR), natural language processing (NLP), and speech synthesis (or text-to-speech, TTS). We aim to democratize and accelerate the progress in these areas by making it easier for researchers and practitioners to access, re-use, and build upon the latest building blocks and pre-trained models in these fields.
NVIDIA NeMo (Neural Modules) is an open-source toolkit based on PyTorch that allows you to quickly build, train, and fine-tune conversational AI models. NeMo consists of NeMo Core, which provides a common “look and feel” for all models and modules and NeMo Collections — groups of domain-specific modules and models. In NeMo’s Speech collection (nemo_asr) you’ll find models and various building blocks for speech recognition, command recognition, speaker identification, speaker verification, and voice activity detection. NeMo’s NLP collection (nemo_nlp) contains models for tasks such as question answering, punctuation, named entity recognition, and many others. Finally, in NeMo’s Speech Synthesis (nemo_tts) you’ll find several spectrogram generators and vocoders which will let you generate synthetic speech.
Voice swap example
Let’s start our introduction to NeMo with a simple prototype. In this example, we will take an audio file and replace the voice in it with a synthetic one generated by a NeMo model. Listen to it here.
Conceptually, this app demonstrates all three stages of a conversational AI system: (1) speech recognition, (2) deriving meaning or understanding what was said, and (3) generating synthetic speech as a response. If you have GPU-enabled PyTorch version 1.6 or later, NeMo can be installed simply via PIP like so
pip install nemo_toolkit[all]==1.0.0b1. The first step of NeMo-based applications is importing necessary collections. In this app, we’ll be using all three of them.
Collections give us access to NeMo models and we can use them to perform certain conversational AI tasks. Models are one of the key concepts in NeMo. We’ll discuss them in more detail below, but we’ll just use the ones we need for now:
Most NeMo models can be instantiated directly from the NVIDIA NGC catalog using the
from_pretrained(…) function. You can view the list of available pre-trained weights for every model by calling the
As you can see from the code snippet above, we will use the QuartzNet model for speech recognition, a punctuation model based on DistillBert, and Tacotron2 + WaveGlow models for speech synthesis. Note that NeMo’s NLP collection is compatible with the excellent Hugging Face transformers library and their language models are often used as encoders by NeMo’s NLP models. Once all models are instantiated, they are ready to use. Here is an example of using the ASR model to transcribe an audio file and NLP model to add punctuation to the transcribed text:
Please refer to this interactive Google Colab notebook for a complete running example. Notice how the punctuation model makes a huge difference in the quality of the generated speech. Speech generated based on the output of the punctuation model is much easier to understand than the one created directly from the ASR model’s raw output because it contains pauses and intonations in the proper places.
NeMo models, neural modules, and neural types
In NeMo, there are three main concepts: model, neural module, and neural type. Models are meant to be “full recipes” containing all the information necessary for training and fine-tuning. As such, they encapsulate:
- Neural network implementation — all the neural modules connected for training and evaluation.
- All necessary pre- and post-processing — tokenization, data-augmentation, etc.
- Dataset classes which can be used with this model.
- Optimization algorithm and learning rate schedule.
- Infrastructure details — such as how many GPUs, nodes and what kind of training precision should be used.
As we saw in a demo above, most models can be instantiated with particular pre-trained weights directly from the repository on the NVIDIA NGC catalog.
Deep neural networks can be often thought of as consisting of conceptual building blocks responsible for different tasks. An encoder-decoder architecture is a famous example. An encoder is tasked with learning the input representation, while a decoder is responsible for generating an output sequence based on it. In NeMo, we call these blocks “Neural Modules” (btw, this is where NeMo name came from). A Neural Module (nemo.core.NeuralModule) represents a logical part of a neural network such as a language model, an encoder, a decoder, a data augmentation algorithm, a loss function, etc. They form the basis for describing a model and the process by which that model is trained. NeuralModule class is derived directly from torch.nn.Module so you can use modules from NeMo collections inside your PyTorch applications. Collections have hundreds of neural modules for you to re-use in your models.
Inputs and outputs to Neural Modules are typed with Neural Types. A Neural Type is a pair that contains the information about tensor’s axes layout (similar to Named Tensor in PyTorch) and semantics of its elements. Every Neural Module has input_types and output_types properties which describe (and help enforce) what kinds of inputs this module accepts and what kinds of outputs it returns.
Let’s consider how models, neural modules, and types interact with one another. If we peek under the hood of a forward() method of QuartzNet model, we’ll see:
The QuartzNet Model contains preprocessor, (optionally) spectrogram augmentation, encoder, and decoder neural modules. Note that they are used exactly like you would use torch.nn.Module modules but with added type safety. Here are some of the input/output types of this model’s neural modules:
As you can see, types dictate both tensor layouts and semantics of its elements. The preprocessor will not only check that tensors passed to it are 2-dimensional [batch, time] tensors, but will also enforce that the elements inside the tensor represent an AudioSignal. Neural Types support inheritance which is why MelSpectrogramType output is accepted anywhere SpectrogramType is expected. Types are enforced with the help of @typecheck decorator and the enforcement can be turned on or off. It is an experimental feature, but we found it useful to help modules’ users use them correctly.
Training and fine-tuning with NeMo
NeMo is built for training and fine-tuning conversational AI models. While you can use “pure” PyTorch to work with NeMo’s models and modules, they are most effectively used with two other projects from the PyTorch ecosystem: PyTorch Lightning and Hydra.
A NeMo Model is derived from PyTorch Lightning Module and can be used with Lightning’s Trainer instance. This integration with Lightning makes it very easy to train models with mixed precision using Tensor Cores and can scale training to multiple GPUs and compute nodes. For example, we scaled the training for some of the NeMo models to use 512 GPUs. Lightning also gives users many other convenient features such as logging, checkpointing, overfit checks, and others.
NeMo users can use Facebook’s Hydra to parametrize their scripts. A typical deep learning experiment can contain hundreds, if not thousands, of parameters. Which is why it is handy to keep them in well-organized configuration files. NeMo models and modules use Hydra for parametrization giving our users flexibility and error-checking capabilities provided by Hydra.
The integration with PyTorch Lightning and Hydra makes it possible to streamline common tasks for our users. Consider the example below. It is a complete Python script which is able to take a .yaml config file and train a speech recognition model. NeMo + Lightning + Hydra standardize many things and with only two lines changed, we can turn it into a script for training BERT-based question answering model.
NeMo is built for people curious about Conversational AI — speech recognition, natural language processing and speech synthesis. We also put a lot of effort and compute power into creating collections of pre-trained models which would be useful for our users.
We encourage you to try NeMo. Go to our GitHub to checkout interactive tutorials with NeMo. A voice swap example we discussed at the beginning of this blog post is a great place to start.
Finally, NeMo is being developed as an open-source project on GitHub and we welcome external contributions. There are many ways you can contribute, from working on code or docs to training models in new languages.
The authors would like to thank NeMo research and engineering team as well as our partners PyTorch and PyTorch Lightning for bringing this blog post to you.