Speech Recognition System Comparison: Microsoft vs IBM 2016
Just a quick comparison between the two very interesting articles.
This article compares the IBM 2016 Speech Recognition Systems (article: The IBM 2016 English Conversational Telephone Speech Recognition System by G Saon etc. a.k.a ISR2016) and the Microsoft 2016 Speech Recognition Systems (article: The Microsoft 2016 Conversational Speech Recognition Systems by W Xiong etc. a.k.a. MSR2016). I have also previously wrote about the Microsoft paper here.
Data Extraction and Preprocessing
Data Extraction: same. 25ms analysis frame and 10ms frame-shift.
MSR2016: log-filterbank features extracted
- VTL: vocal tract length. Normalize according to speakers’ vocal track length. For example, female VTL is typically shorter than male VTL.
- PLP: Perceptual Linear Prediction. A technique to remove speaker voice style differences.
- Mel-frequency cepstrum : a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
Speaker Adaptation Model:
MSR2016: 100 dimension i-vector
ISR2016: 100 dimension i-vector, VTL, PLP etc.
- SwitchBoard 1: 262 hours
- Fisher: 1698 hours
- CallHome: 15 hours
MSR2016 Acoustic Training Data:
- Switchboard-1 Release 2 (LDC 97S62)
- Fisher English Training Speech Part 1 Speech (LDC 2004S13)
- Fisher English Training Part 2, Speech (LDC 2005S13)
- 2002 Rich Transcription Broadcast News and Conversational Telephone Speech (LDC 2004S11)
- NIST Meeting Pilot Corpus Speech (LDC 2004S09)
MSR2016 Language Training Data: CTS transcripts from the DARPA EARS program:
- Switchboard (3M words),
- BBN Switchboard-2 transcripts (850k),
- Fisher (21M),
- English CallHome (200k),
- the University of Washington conversational Web corpus (191M).
ISR2016 - Hub5 2000:
- SwitchBoard Data: 2.1 hours with 21.4K words and 40 speakers
- CallHome Data: 1.6 hours with 21.6K words and 40 speakers
MSR2016 appears to use CallHome and Switchboard testing data. Not specified in the article.
Acoustic Model Training
- Recurrent nets with maxout activations trained with Hessian-free sequence discriminative training
- Very deep convolutional networks: similar to VGG: cross-entropy training + NAG?
- Bidirectional LSTM
- CNN variant — VGG
- CNN variant — Residual Net
- CNN variant — LACE (layer-wise context expansion with attention) model
- Bidirectional LSTM
MSR2016 uses cross-entropy training plus lattice-free maximum mutual information (LFMMI) training.
Language Model Training
- vocabulary size is 85K
- trained a 4-gram model with modified Kneser-Ney smoothing
- linearly interpolated with weights chosen to optimize perplexity on a held-out set
- Entropy pruning
- LM Rescoring: model M (a class-based exponential model) and feed-forward neural network LM (NNLM)
An initial decoding is done with a WFST decoder, using the architecture described in . We use an N-gram language model trained and pruned with the SRILM toolkit . The first-pass LM has approximately 15.9 million bigrams, trigrams, and 4grams, and a vocabulary of 30500 words. and gives a perplexity of 54 on RT-03 speech transcripts. The initial decoding produces a lattice with the pronunciation variants marked, from which 500-best lists are generated for rescoring purposes. Subsequent N-best rescoring uses an unpruned LM comprising 145 million N-grams. All N-gram LMs were estimated by a maximum entropy criterion as described in .
CUED-RNNLM toolkit is used to train and score various LM RNNs: forward predicting RNNLM and backward RNNLM.