Speech Recognition System Comparison: Microsoft vs IBM 2016

Just a quick comparison between the two very interesting articles.

This article compares the IBM 2016 Speech Recognition Systems (article: The IBM 2016 English Conversational Telephone Speech Recognition System by G Saon etc. a.k.a ISR2016) and the Microsoft 2016 Speech Recognition Systems (article: The Microsoft 2016 Conversational Speech Recognition Systems by W Xiong etc. a.k.a. MSR2016). I have also previously wrote about the Microsoft paper here.

Data Extraction and Preprocessing

Data Extraction: same. 25ms analysis frame and 10ms frame-shift.

Data Processing:

MSR2016: log-filterbank features extracted


Speaker Adaptation Model:

MSR2016: 100 dimension i-vector

ISR2016: 100 dimension i-vector, VTL, PLP etc.

Data Source

Training Data:


  • SwitchBoard 1: 262 hours
  • Fisher: 1698 hours
  • CallHome: 15 hours

MSR2016 Acoustic Training Data:

  • Switchboard-1 Release 2 (LDC 97S62)
  • Fisher English Training Speech Part 1 Speech (LDC 2004S13)
  • Fisher English Training Part 2, Speech (LDC 2005S13)
  • 2002 Rich Transcription Broadcast News and Conversational Telephone Speech (LDC 2004S11)
  • NIST Meeting Pilot Corpus Speech (LDC 2004S09)

MSR2016 Language Training Data: CTS transcripts from the DARPA EARS program:

  • Switchboard (3M words),
  • BBN Switchboard-2 transcripts (850k),
  • Fisher (21M),
  • English CallHome (200k),
  • the University of Washington conversational Web corpus (191M).

Testing Data:

ISR2016 - Hub5 2000:

  • SwitchBoard Data: 2.1 hours with 21.4K words and 40 speakers
  • CallHome Data: 1.6 hours with 21.6K words and 40 speakers

MSR2016 appears to use CallHome and Switchboard testing data. Not specified in the article.

Acoustic Model Training


  • Recurrent nets with maxout activations trained with Hessian-free sequence discriminative training
  • Very deep convolutional networks: similar to VGG: cross-entropy training + NAG?
  • Bidirectional LSTM


  • CNN variant — VGG
  • CNN variant — Residual Net
  • CNN variant — LACE (layer-wise context expansion with attention) model
  • Bidirectional LSTM

MSR2016 uses cross-entropy training plus lattice-free maximum mutual information (LFMMI) training.

Language Model Training


  • vocabulary size is 85K
  • trained a 4-gram model with modified Kneser-Ney smoothing
  • linearly interpolated with weights chosen to optimize perplexity on a held-out set
  • Entropy pruning
  • LM Rescoring: model M (a class-based exponential model) and feed-forward neural network LM (NNLM)


An initial decoding is done with a WFST decoder, using the architecture described in [31]. We use an N-gram language model trained and pruned with the SRILM toolkit [32]. The first-pass LM has approximately 15.9 million bigrams, trigrams, and 4grams, and a vocabulary of 30500 words. and gives a perplexity of 54 on RT-03 speech transcripts. The initial decoding produces a lattice with the pronunciation variants marked, from which 500-best lists are generated for rescoring purposes. Subsequent N-best rescoring uses an unpruned LM comprising 145 million N-grams. All N-gram LMs were estimated by a maximum entropy criterion as described in [33].

CUED-RNNLM toolkit is used to train and score various LM RNNs: forward predicting RNNLM and backward RNNLM.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade