Sogou version Siri is tempered

2011 dog at the end of the formation of a team of speech recognition;

Launched in June 2012 the first phonetic search engine;

On August 3, launched a voice search engine sogou–“Companion”, and the integration of the slur optimization, error correction and a number of rounds of interaction of three new features. Vera Bradley phone cases

Third party test data shows that dog voice and large flying performance remained at the same level, and this is clearly beyond the sogou phonetic team was founded at the beginning of the forecast.

Air inlet under the sogou phonetic

Around 2012, speech recognition is not a concept that gained popularity, nor are there many companies flock to the Internet, and HKUST flew at the time have emerged.

Sogou input method, and searches made, IME products after completing several iterations and maturing, sogou CEO Wang xiaochuan tried to cut into the speech recognition market, but the company did not originally intend to research and development alone.

Flying high at HKUST has found us, aims to introduce a better combination of voice products.

Sogou speech interactive technology center director Wang Yanfeng said.

Sogou speech interactive technology center

And search engine sogou input method, HKUST flying voice technical reserves, their cooperation is a win-win situation. But the negotiations did not go well, although University flying promised to cooperate in speech Assistant products, namely by sogou provides back-office services, the HKUST iflytek is responsible for front-end products. However, both sides did not strike a balance of interests, cooperation breakdown.

Wang Yanfeng, “said HKUST foothold flying hope to pass input into the Internet, and we are ready to enter the mobile Internet will probably not give the input method advantages to others. ”

Cooperation fell through gave the dog a second option–developing a search dog proprietary speech recognition products. Soon reached a consensus within the company-“the clock is ticking, and it’s quick do it yourself up! ”

After you made up your mind, sogou began recruiting expansion team. But the speech technologies accumulated in a short time can be done, so the first step is to select the trusted sogou technical skills and team-mate, that is Google. The first half of 2012, thanks to the Google engine sogou, Google is responsible for data collection and product development progress very quickly.

“Beginning in January to do this thing by June accuracy line made a version of the engine, measured data in third parties displayed on this version of the engine on the map’s accuracy has been beyond Baidu. ”

Map engine from behind, accuracy beyond the Baidu, sogou that only speech recognition into the half was a near perfect answer.

Nevertheless, this product still has some problems, experience has greatly improved, and flying University and there is a gap, so did not let its sogou input method on testing. According to Wang Yanfeng, map scenario the convergence of speech required is much lower than the input.

Six months later (November 2012), and sogou input method data accumulation, the company abandoned the Google engine, on the input with their engines on and on speech recognition extends to the input method.

With the popularity of Siri, also captured a large number of speech recognition products c-terminal users. The year of 2013, sogou input method the amount of data accumulated to 15,000 hours, rely on these data, deep learning and maturing team, sogou speech recognition performance has remained and HKUST fly flat, app and Baidu, and the second is know and think of companies such as Chi.

New beginnings: Dog Edition “Siri” came

With inlet, a speech recognition company, is a unique advantage.

In terms of data volume, sogou, Baidu to other companies with a distinct advantage. But rather than large flies and Baidu and other brands, rare dog audio external sound in the industry until the “Companion” release.

“Faithful friend” dog voice is very important, without be too says it is synonymous with the latter, as secret as for Baidu, GoogleNow on Google, Siri to Apple …

Since introduction of the sogou’s official view, “Companion” has slur three optimization, error correction, and multiple rounds of interaction features. Though it is not innovation, but from a technical point of view, these three functions are all gold.

Slur optimization

Slur questions from users, if talking speed too fast would slur issue, and machine will not adapt to the sound of nature.

If you need to accurately identify the fast speed of the voice, that would require technology and a wealth of material support. Wang Yanfeng expressed in language training when a large number of slur corpus, in addition to making some modeling optimization of the slur, which is the Foundation of slur.

“Companion”, for example:

Companion pronunciation modeling using a LSTM+CTC model of the part, to pronounce itself, as well as a detailed description of the differences between the pronunciation;

Salon also uses the language model based on neural network for recognition results, revises, rely on longer historical information would slur minimizing impact on the results;

In addition, the friends also did screening and generated at the data level, by adjusting the data allocation optimal slur recognition results.

Speech correction (modified)

Generally speaking, voice changes exist for speech recognition error fill, it may help the user to use natural speech modification of the wrong identification, without the need for manual operation.

Modifying process includes speech recognition (recognition error correction by the user command), semantic analysis (analysis of user modified intention), text revision (modification commands that can be executed) in three steps, overall system performance optimization is a process of joint optimization, voice-recognition-oriented vertical categories, but knowledge of language models rely heavily on semantic analysis module.

Based on semantic analysis, you also need to input and search for knowledge, such as split input fonts like early chapters, mass lexicon such as ink-slabs, Qiu Yong search knowledge map such as Tsinghua University and so on.

Several rounds of interaction

Round of talks has been a speech recognition is difficult, although there are many voice products claiming to possess the capability of multiple rounds of interaction, but the actual performance is another matter.

If only to express a command, would only relate to classification of machine learning problems, but several rounds of interaction problems are complex. He needs context, and the user’s behavior is often unpredictable, and will generate a lot of new paradigms of behavior, reflected is the opportunity to add more States and between States, based on user-generated data, dynamic, constantly building or adjusting the state machine, which is inside the biggest difficulty for several rounds of interaction.

To make several rounds of interaction you need powerful mapping knowledge and technical architecture.

“Without a good knowledge maps and technical architecture, your voice will only be a toy. “Wang Yanfeng describes it. Vera Bradley cases

Judging from various iterations of speech recognition products, experience of the contest has risen to a new level, but you can be sure the future on product differentiation will not be reflected in the technical, but the accumulation of data, search dogs will “Companion” to what heights we’ll see.

Like this:

Like Loading…


Originally published at on August 5, 2016.