The challenges of speech recognition technology

Anryze’s Speech Recognition System is cutting-edge technology. We like to say that we are creating a system that will help computers understand people just like people understand people.

In developing Anryze speech-to-text (STT) we have applied a series of technologies and solutions. Just some of these include:

  • Wavelet transformation. This allows us to improve recognition by reducing loss of data.
  • Recognition using fractal code descriptor, which reconstructs the signal using any sampling frequency.
  • Multi-Objective Learning for Deep Neural Network Based Speech Enhancement. This is used to construct extended speech signals.
  • Invariant Representations, to increase the stability of acoustic variability.
  • Highway Connections in Convolutional Recurrent Deep Neural Networks. This is an extension of the CLDNN model that works by integrating connections, providing a direct stream of information from the cells of the lower layers to the cells of the upper layers.
  • Recognition with the use of distributed capacity. This allows us to reduce equipment costs and achieve resistance to changing loads.
  • Acoustic models based on long short-term memory. This is a recurrent neural network architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs.

Speech is a complex phenomenon. People rarely understand how is it produced or perceived. The naive perception is often that speech is built with words, and each word consists of phones (a distinct speech sound or gesture, and a separate concept to phonemes). The reality is actually very different. Speech is a dynamic process without clearly distinguished elements. A helpful illustration of this is to use a sound editor to look at and listen to a recording of a speech.

Description of speech is to some degree probabilistic. That means there are no distinct boundaries between units, or between words. Speech-to-text transcription is never 100% correct. This idea is rather unusual for software developers, who typically work with deterministic systems. This creates a number of issues specific only to speech technology.

Speech structure

Speech is a continuous audio stream in which relatively stable states mix with dynamically changing ones. In this sequence of states, one can define more or less similar classes of sounds, or phones. Words are understood to be built of phones, but this is incorrect. The acoustic properties of a waveform corresponding to a phone vary markedly depending on many factors — context, speaker, style of speech, and so on. So-called ‘coarticulation’ makes phones sound very different from their ‘canonical’ representation.

Furthermore, since transitions between words are more informative than stable regions, developers often talk about diphones — parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units — different substates of a phone. Often three or more regions of a different nature can be found.

Recognition process

The standard way in which Anryze recognises speech is as follows. We take the waveform, split it into segments (‘utterances’) on the basis of the silences between them, and then try to interpret what is being said in each utterance. To do that we want to take all possible combinations of words and try to match them with the audio, selecting the best matching combination.

There are several important things to know about this matching process. Firstly, it’s a matter of features. Since the number of parameters is large, we aim to optimise it. Numbers are usually calculated from speech by dividing speech into frames. Then, for each frame (typically 10 milliseconds in length) we extract 39 numbers that represent the speech within it. This is called the Feature Vector. The way to generate these numbers is a subject of active investigation, but in a simple case it is a derivative of a spectrum.

Secondly, it is matter of the model. ‘Model’ describes some mathematical function that gathers common attributes of the spoken word. In practice, the audio model of a senone (very broadly, a senone is a phone considered in its wider context) is a gaussian mixture of its three states; to put it simply, it is the most probable feature vector. The model used by almost all modern speech recognition systems is called the Hidden Markov Model or HMM. This is a generic model that describes a black-box communication channel. In this model, the process is described as a sequence of states which change each other with a certain probability. HMM is intended to describe any sequential process like speech, and is of proven effectiveness in decoding speech. The following issues arise from the concept of the model: how well does the model fit in practice; can the model be improved in terms of its internal problems; and to what extent is the model adaptive to changing conditions.

Thirdly comes the matter of the matching process itself. Since it would take a long time to compare all feature vectors with all models, the search is optimised in various ways. At any point, we maintain the best matching variants and extend them with time to produce the best matching variants for the next frame.

Speech structure models

There are three models Anryze uses for matching in speech recognition. Firstly, an acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones).

Secondly, a phonetic dictionary maps words to phones. This mapping is not very effective. For example, only two to three pronunciation variants are noted in it, but it’s practical enough most of the time. The dictionary is not the only means to map words to phones. It could be carried out using some complex function created with a machine-learning algorithm.

Thirdly, a language model is used to restrict word search. It defines which words can follow previously-recognised words (remember that matching is a sequential process) and significantly restricts the matching process by stripping out words that are unlikely. Most common language models used are so-called n-gram language models, containing statistics of word sequences, and finite state language models, which define speech sequences by finite state automation, sometimes with weights. To reach a good accuracy rate, the language model must be very successful in search space restriction. This means it should be very good at predicting the next word. A language model usually restricts the vocabulary considered to the words it contains. That poses an issue for name recognition. To deal with this, a language model can contain smaller chunks like subwords or even phones. Note that search space restriction, in this case, is usually worse and corresponding recognition accuracies are lower than with a word-based language model.

These three entities are combined together in an engine to recognise speech.