Dry HKUST flying the latest speech recognition system and frame depth

Lei feng’s network; the author Wei Si, PhD, University of Fei, Deputy Director of the Institute, the main research areas of speech signal processing, pattern recognition, artificial intelligence, and has a number of industry-leading research. Zhang Shiliang, Pan Jia, Zhang Zhijiang University iflytek research fellow. Liu Cong and Wang Zhiguo HKUST Fei, Deputy Director of the Institute. Zebian: Zhou Jianding.

Speech as the most natural and convenient way to communicate, man-machine communication and interaction has been one of the most important areas of research. ASR (Automatic Speech Recognition,ASR) is human-computer interaction is a key technology, the problem to be solved is to allow computers to “understand” human speech and translate voice into text. After decades of development of automatic speech recognition technology has achieved significant results. In recent years, more and more speech recognition of intelligent application software and leave everyone’s daily life, Apple’s Siri, Microsoft’s Cortana, HKUST fly speech input and consonance are typical examples. Will fly in his perspective in this paper introduces the development of speech recognition process and the latest technology developments.

We first briefly reviews the development history of speech recognition, and then introduced the current mainstream of speech recognition system based on neural networks, finally focusing on University flying the latest advances in speech recognition systems.

Voice recognition key breakthrough review Just Cavalli iPad case

Speech recognition research originated in the 50 ‘s of last century, when Bell Labs, principal investigator. Earlier voice recognition system is a simple isolated word recognition system, such as the 1952 achieved a ten word recognition system at Bell Labs. Starting in the 60 ‘s, the continuous speech recognition CMU Reddy began pioneering work. But during that speech recognition technology advances very slowly, that John Pearce 1969 at Bell Labs (John Pierce) in an open letter to voice recognition as “turn water into gasoline, extraction of gold from the sea, a cure for cancer” almost an impossible thing. The 70 ‘s of the last century, and greatly enhance your computer’s performance, and the development of basic research in pattern recognition, such as code generation algorithm (LBG) and linear predictive coding (LPC) appeared, and promoted the development of speech recognition.

The United States Department of Defense Advanced research projects Agency (DARPA) intervention speech area, set up the speech understanding research program, research projects include BBN, CMU, SRI, IBM and other leading research institutions. IBM, Bell Labs launched a real-time PC end of isolated word recognition system. The 80 ‘s of the last century is a period of rapid development of speech recognition, one of the two key technology is the hidden Markov Model (HMM) theory and its application to perfection and NGram model application.

Speech recognition starts from isolated word recognition system to large-vocabulary continuous speech recognition systems. For example, Kai-Fu Lee SPHINX system developed, is based on principles of statistics development of the first “speaker-independent continuous speech-recognition system.” The core framework is to use hidden Marco models to time series modeling of speech and using Gaussian mixture model (GMM) to model the observed probability of voice. Speech recognition framework based on GMM-HMM for a long time thereafter has been a leading framework for speech recognition systems. Last century 90 ‘s speech recognition is the mature period of major progress in speech recognition is the acoustic model of discriminative training criterion and adaptive methods is presented. This period Cambridge speech recognition Group launched the HTK Toolkit for promoting the development of speech recognition has played a big role. After speech recognition is slow, mainstream GMM-HMM the framework of stabilizing, but recognition still far away from practical speech recognition research into a bottleneck.

Key breakthrough began in 2006. The year Sinton (Hinton) proposed deep belief networks (DBN), which prompted a deep neural network (Deep Neural Network,DNN) study on recovery, setting off a wave of deep learning. In 2009, Hinton and his student Muhammad (D. Mohamed) the depth of neural network applied to acoustic modeling for speech, success in small vocabulary continuous speech database TIMIT. 2011 MSR Yu Dong, Deng Li published depth the application of neural network in speech recognition article, breakthrough in large vocabulary continuous speech recognition tasks. Speech recognition framework based on GMM-HMM is broken, a lot of researchers turned to the study of speech recognition system based on DNN-HMM.

Speech recognition system based on neural network

Speech recognition system based on neural network using the framework shown in Figure 1. Compared to conventional speech recognition system based on GMM-HMM, the biggest change is replaced by deep neural network observation probability model of GMM models for speech. Deep neural network originally mainstream is the most simple depth of feed-forward neural network (Feedforward Deep Neural Network,FDNN). GMM in comparison to DNN advantages: 1. Using DNN estimated posterior probability distribution HMM State does not need to assume that the voice data distribution; 2. The import feature can be of many features of DNN integration, including discrete or continuous; 3. DNN can use contiguous speech frames contain structural information.

Figure 1 framework of the speech recognition system based on neural network

Speech recognition requires the waveform window, frame, extract features of pretreatment. Training GMM when input signal characteristics can only be single-frame, while for DNN mosaic frames can be used as input, which is compared to the DNN GMM can get very big key factor in performance improvement. However, the speech is a complex of each frame is a very strong correlation between time-varying signals, which mainly reflected on talk the coarticulation phenomenon, often well before and after the word has an impact on the words we are saying, is a long-term relationship between phoneme frame. Mosaic frames can learn a certain amount of context information. But since DNN input window length is fixed, is learning to fix mapping relations entered into the input, resulting in DNN when modeling the relationship between timing information is weak.

Figure 2 sketch of DNN and RNN

Taking into account the long-term correlation of speech signal, a natural idea is to use a stronger long-modeling capabilities of neural network models. Thus, recurrent neural networks (Recurrent Neural Network,RNN) in recent years has gradually replaced traditional DNN mainstream modeling of speech recognition programs. In Figure 2, DNN than feedforward neural networks, recurrent neural networks added a feedback connection in the hidden layer, that is, RNN hidden input for the current moment there is some hidden layer of the moments before the output, which makes RNN can see all time through a feedback loop connection information, which endows the RNN memory function. These features make the RNN are well suited for modeling the timing signals. Short-term memory modules (Long-Short Term Memory,LSTM) introduced to solve problems such as the traditional simple RNN gradients disappear, RNN framework in the field of speech recognition practical and transcend DNN effect has been used in some of the more advanced sound system in the industry. In addition, researchers are still based on RNN do to further improve its performance, as shown in Figure 3 is the current mainstream RNN acoustic model for speech recognition framework consists mainly of two parts: the deep bi-directional RNN and sequence short classification (Connectionist Temporal Classification,CTC) output layer. Two-way RNN the speech frames are judged, and uses not only the history of voice messages, you can also use the next voice message, thereby making more accurate decisions; CTC training without frame-level tagging, to achieve effective “end to end” training.

Figure 3 mainstream speech recognition system framework based on RNN–CTC

HKUST flying the latest speech recognition system

There have been numerous international and domestic academic institutes or industries in the RNN model and study in one or more of the above techniques. However, each of these technologies can generally get better results when a separate study, but if you want to use these technologies come together you will run into problems. For example, multiple technologies combine to increase smaller than the superposition of various technical points increase. Another example for the current mainstream bi-directional RNN’s voice recognition system, is one of the biggest problems facing the process of the practical: in theory only a comprehensive list of all voice before you can successfully use information in the future. This makes it a great delay, can only be used to deal with offline tasks. And for real time voice interaction, such as voice input, bi-directional RNN is obviously not applicable. Furthermore, RNN fitting for context correlation is strong, relative to the DNN more vulnerable to overfitting problems easier for local training data is not robust and the identification of additional errors. Finally, the RNN has a more complex structure than the DNN, the RNN model for mass-data training brings greater challenges.

Fly FSMN speech recognition framework

In view of the above, HKUST fly has developed a name for the feed-forward sequence memory network FSMN (Feed-forward Sequential Memory Network) framework. This framework can be well above the fusion and various technologies to enhance the effect of superposition can be obtained. It is worth mentioning that, FSMN non-loop feedforward structure only need 180ms delay, reach and effect of bidirectional RNN.

Figure 4 (a) is a schematic drawing of FSMN, compared with DNN, we added next to the hidden layer called “memory” module, used to store historical information is useful to determine the current audio frame and future information. Figure 4 (b) draw a bidirectional FSMN memory in each memory 1 voice message (in the actual task, according to need, adjust the length of the memory of the history and the future of information needed) temporal structure. We can see from the chart, different from the traditional RNN,FSMN memory memory features based on feedback loop is achieved using a feedforward structure. The feedforward structure has two major advantages:

First, when bidirectional FSMN memory for the future information, without traditional two-way RNN must wait for the end of the audio input in order to judge the current audio frame constraints, it only need to wait for a limited voice in the future of the length of the frame, as previously stated, our bi-directional FSMN in the case of delay control 180ms comparable two-way RNN effect can be obtained;

Second, as Qian by said, traditional of simple RNN because training process in the of gradient is by time successive ahead spread of, so will appeared index attenuation of gradient disappeared phenomenon, this led to theory has unlimited long memory of RNN actually can remember of information is limited, however FSMN this based on Qian feed timing expand structure of memory network, in training process in the gradient along Figure 4 in the memory block and hidden layer of connection weight back to all moments can, The weight determines the input at different time to judge the impact of the current audio frame, and attenuation of the spread at any given moment the gradient is constant as well as training and FSMN in a more simple way to solve the problem of RNN gradients disappear, to have the long-term memory abilities similar to the LSTM.

In addition, training the model efficiency and stability, due to FSMN based on feed-forward neural network, so there are no RNN mini-batch training need leading 0 and varying sentence length resulted in a waste of computing, parallel feedforward structure makes it more, which maximizes the use of GPU compute power. From the ultimate convergence of bidirectional FSMN memory each time-weighted distribution we observed weight value at this point is basically the maximum, to the left and right sides gradually decay, this is also in line with expectations. Further, FSMN and CTC guidelines, implementation of speech recognition “end to end” model.

Figure 4 FSMN frame

HKUST fly DFCNN speech recognition framework

FSMN’s success has given us a very good inspiration: the voice of long-term relevance model does not need to look at the whole sentence does not necessarily need to use a recursive structure, just long enough to voice good expression context information can also provide enough help for the current frame, and Convolutional neural networks (CNN) can also do this.

CNN is used for speech recognition system as early as 2012, and has always been a lot of researchers actively engaged in the study of speech recognition system based on CNN, but no major breakthroughs. The main reason is that they do not break the traditional feed-forward neural networks using fixed-length frames mosaics mindset as input, making it impossible to see enough voice context information. Another weakness is their only CNN as a feature Extractor, convolution layers are rarely used, usually only one or two layers, this convolution networking skills are very limited. To solve these problems, combining research and development experience from the FSMN, we developed a deep whole Convolutional neural networks (Deep Fully Convolutional Neural Network,DFCNN) speech recognition framework, using a lot of convolution direct modeling the entire speech, better expresses the voice of the long-term relationship.

DFCNN of structure as Figure 5 by shows, it directly will a sentence voice into into a Zhang image as entered, that first on each frame voice for FT leaves transform, again will time and frequency as image of two a dimension degrees, then through very more of volume product layer and pool of (pooling) layer of combination, on whole sentence voice for built die, output unit directly and eventually of recognition results like syllables or characters relative should. Mechanism of DFCNN seems like a highly respected expert in Phonetics, by “Watch” the spectrogram to know expression of speech content. For many readers, at first glance one might think that writing science fiction, but after reading our analysis below, I believe we all think that this architecture is so natural.

Figure 5 DFCNN diagram

First, from entered end view, traditional voice features in ft leaves transform zhihou using various artificial design of filter group to extraction features, caused has frequency domain Shang of information loss, in high frequency regional of information loss is obviously, and traditional voice features to calculation volume of consider must used very big of frame moved, undoubtedly caused has Shi domain Shang of information loss, in talk people speed more fast of when performance have more highlight. DFCNN spectrogram directly as input, compared to other traditional voice features speech recognition as input has a natural advantage over the framework. Second, from model structure view, DFCNN and traditional voice recognition in the of CNN practices different, it reference has image recognition in the effect best of network configuration, each volume product layer using 3×3 of small volume product nuclear, and in multiple volume product layer zhihou again plus pool of layer, such greatly enhanced has CNN of expression capacity, meanwhile, through cumulative very more of this volume product pool of layer on, DFCNN can see very long of history and future information, This ensures DFCNN voice can be expressed brilliantly long-term relevance, RNN networks than in more great robustness. Finally, judging from the output, DFCNN can also be CTC programmes and recent very hot perfect combination to achieve the end-to-end training throughout the model, and it includes special structures such as pooling allows end-to-end training to become more stable.

And a number of other technical points combined, HKUST fly DFCNN speech recognition within the framework of Chinese voice messages of thousands of hours of listening tasks, compared with the industry’s best speech recognition framework for bi-directional RNN-CTC system 15% performance improvement, combined with large flies and multi-GPU acceleration technology in parallel HPC platform, train speed is better than the bi-directional RNN-CTC system. DFCNN the proposal opens up a new world for speech recognition, follow-up is based on DFCNN framework, we will also conduct more research, such as: bi-directional RNN and DFCNN can provide long-history and future of information expression, but whether there is complementarity between the two expressions, is the question.

Deep learning platform

HKUST flew above have very good speech recognition results of research, while HKUST fly is also aware of the depth of these neural network requires a lot of data and computation for training. For example, 20,000 hours of voice data about 12000PFlop of computation, if you train a E5–2697 v4 on the CPU, about 116 days, this is unacceptable for speech recognition research. To this end, the HKUST fly analysis algorithm of calculation, set up a quick set of deep learning computing platform-deep learning platform.

Figure 6 deep learning platform Just Cavalli iPad mini

As shown in Figure 6, the whole platform is divided into four parts. First, the underlying infrastructure, based on the amount of speech data, access bandwidth, access, computation, calculation of frequency characteristic, choose file system, network connections, and computing resources. Among them, the file system using a parallel distributed file system network using gigabit connections, compute resources using a GPU cluster, and a separate building dedicated rooms. On this basis, develop core calculation engine, used for training in various models and calculations, for CNN calculation engines, calculation engine for DNN and suitable for FSMN/DFCNN calculation engine, and so on. Calculation engine for users, and infrastructure as a whole is quite abstract, simplified usage threshold, HKUST news flew a specially developed platform of resource scheduling service calls and engine service; these efforts greatly reduce the difficulty of using cluster resources Institute staff, promote the progress of the research. This work on the basis of three, HKUST flew deep learning platform can support the entire research-related work, such as voice recognition, speech synthesis, handwriting recognition … …

HKUST fly using the GPU as the main unit, and characteristics of the algorithm, a large number of GPU parallel work. Where the University fly-in model update (BMUF) based on the fusion of average elastic stochastic gradient descent (EASGD) algorithm of parallel computing framework, 64 on the GPU to achieve a near-linear speedup, boost training efficiency, accelerate the process of deep learning related research.

Written in the last

Recalling the development of speech recognition of history and University flying after the latest advances in speech recognition systems, we can see that technology breakthroughs are always difficult and slow, it is important to uphold and continue. Deep neural network speech recognition performance in recent years was a great improvement, but we are not superstitious, the existing technology, one day technology will replace the existing technologies, the HKUST flying hope that through continuous technological innovation to achieve further breakthroughs in speech recognition technologies.

Lei Feng network Note: this network authorized by the CSDN Lei Feng (search for “Lei feng’s network” public concern) reprinted, for reprint, please contact the original author.

Like this:

Like Loading…


Originally published at itedbaker.wordpress.com on October 12, 2016.