Useful Tips for Text to Speech Applications | HMS ML Kit

Mustafa Sürücü
Huawei Developers
Published in
5 min readNov 17, 2020

Hi folks,

In this article, I would like to put forward useful scenarios that you can use in your Text to Speech applications which is developed with HMS. Text-to-Speech capabilities have been explained together with Language Detection service in my last article. If you are new to this topic, I strongly recommend you to start by reading my last post that I shared below.

The use cases that we can implement to enrich our applications:

  1. Selection of timbres for different language options.
  2. Allowing the users to adjust speed and volume settings
  3. Sentence tracking synchronized with speaker.
  4. Obtaining audio output of Text to Speech service.

HMS - Text to Speech can convert text information into audio output in real time. 6 languages are supported in last version of the SDK; English, Chinese, French, Spanish, German and Italian respectively. However, male and female timbres are only available for Chinese and English. For other languages just female timbre is supported.

We can give the opportunity to the users to choose male or females timbres for English and Chinese inputs. For the rest of the languages, only female voice timbre is supported, so we can adjust our settings in accordance with that.

You can find the updated version of our Interface class below.

In the flow of our application, language of the input text has been specified primarily. Then, speaker will be selected according to language of the text.

If the input language of text is English or Chinese, we will open a dialog box for timbre selection of first input. If the language is not English or Chinese, the speaker will be assigned as female.

As you see above, for the first input of Chinese and English texts, we are opening a dialog box. If this is not the first input, code may realize it from giveCurrentLng() method and allow users to change the selection without opening a dialog box.

The gender of the timbre is assigned according to user’s selection and it is tracked with the booleans for the following inputs.

Dialog box for Timbre Selection

A configuration object should be defined for Text to speech Engine. After this step we will set required configs in accordance with the scenario.

We are setting the currentLanguage here to understand whether it is necessary to choose timbre or not for the next input. If user made a selection for Chinese and English, we will not ask it again.

Our second scenario is allowing user to adjust speed and volume values from the same configuration object. If you check the first method that we detect language, you could realize that we initialized mlConfigs object and set speed and volume values that arranged by user. The methods have been used to fetch last values of the seek bars that shared below.

Volume & Speed Settings

One of the most common and important features of Text to Speech applications is tracking sentences synchronized with the speaker. To implement this feature we will use “callback” object of the engine.

onRangeStart() is one the methods of callback to manage Text to Speech engine. This methods returns the mapping between the currently played segment and the corresponding text that being processed. We will create a Hashmap object to provide mapping between task IDs and sentences.

The speak() method of the engine creates a task ID for each input text. As we give each sentences within the loop, all sentences are mapped with a task ID. The speak method will also trigger onRangeStart() at every turn. We will call our tracking method to highlight the currently played segment by using the mapping.

The listen() method will take the corresponding sentence as a parameter and highlight the text synchronized with currently played audio segment.

Sentence Tracking

The last part of the article is about obtaining audio output of the Text to Speech service. Getting the audio data from engine is primarily related to how you use the speak method.

  • When you use speak(sentences[i], MLTtsEngine.QUEUE_APPEND), built-in player of the SDK is used to play audio fragments in queuing mode.
  • When you use speak(sentences[i], MLTtsEngine.QUEUE_APPEND or MLTtsEngine.OPEN_STREAM), open_stream parameter will trigger onAudioAvailable() method of the callback and it gives you synthesized audio stream as an output. However, built-in player of the SDK is used to play audio fragments like first scenario.
  • When you use speak(sentences[i], MLTtsEngine.QUEUE_APPEND or MLTtsEngine.OPEN_STREAM or MLTtsEngine.EXTERNAL_PLAYBACK), you can get synthesized audio stream from onAudioAvailable() but, audio stream is not played just can be controlled by you.

onAudioAvailable() method returns us an audioFragment object. We can get below items from audioFragments.

  • audioData → byte[]
  • audioFormat → int
  • sampleRateInHz → int
  • channelInfo → int

HMS Text to Speech supports PCM data to create human speech in current version. It may provide audio files like mp3 in following releases. You can create your own audio file in your file system by saving a .pcm file.

We can create a whole audio file by combining each ByteArray together. writeToFile() method will take them as a parameter and write them to a file by creating in pcm format.

Now, you had the audio file of your input text. It can be played with AudioTrack or can be used for other use cases.

I wanted to address important details of HMS Text to Speech in this article. I hope it will make your work easier and quicker.

Thank you for reading ! See you in my next articles :)

References

--

--