Conversations with Many Siri

Qiao Lin
Qiao Lin
May 9, 2017 · 3 min read

As a part of the research for our IoT application running on the cloud services Firebase, I built an Voice to Voice iOS application in 3 stages:

  1. Voice to Text Transcription
  2. Text to Text Translation
  3. Text to Voice Playback

A composition of existing shelf solutions, the stack comprise of iOS AVSpeech & Audio session recording with Firebase Cloud subscribe for realtime request to Translation API.

Step thru the interaction:

iOS Interface

User speaks to the phone

📞 Request Speech Authorization 🎤 Request MicrophoneAuthorization 🎧 Start Audio Engine for record Session

Audio recording is transcribe to text

Firebase SDK sends transcribe text with input language code to Realtime Database

Firebase cloud PaaS

A realtime database that push to all subscribers. The NoSql structure, makes it easier for search query, helps avoids deep nesting, however this means some duplication. The message table is divided into language columns where each row is enter by an auto-generated id key. Depending on the input language, the first entry will be posted in the {input Language} column with a translated boolean as false (indicating this is the original snippet), triggers the hook (cloud function)that is listening for value changes to fire off Http Requests to Translation API.

Credit to message-translation

Cloud function listen onWrite, checks for translated value

Upon request success, the cloud functions adds the translated entries in the other {targeted} language columns with the SAME id key and a boolean of True for translated.

Writes new entry with same ID key in target language column

Triggered Action in this case is Translation, which can be replace with other functionalities like AI intention detection for actions, or deep voice sample training.

“Tell Kenichi I’ll be late for the meeting”
"message":"I’ll be late for the meeting",
“contacts":["Kenichi Takahira”],

Updated realtime database now pushes automatically to other subscriber.

back to iOS interface

at the end of postToFirebaseForTranslation function startObserver was called

database reference register observer on EventType.value changes, returns a snapshot

Last but not least user gets a table of translated text and pressing will trigger an utterance thru the AV Voice synthesizer

Top is the input speech transcribed into the table of targeted languages, speech control tab for playback

At the moment (May, 2017) there is only 37 voices available for speech utterance, each with it’s own name and characteristic profile. 5 of them are in English, 3 in Chinese says a lot about adaptation of voice.

Jamaican rap song feed into translation text for German & English Voice Utterances
Japanese & Spanish Voice Utterances

To have a little fun, feeding in a Jamaican rap song, (relatively hard to comprehend english) the application was able to transcribe a large amount of the voice clip. Sending it to translate, maybe we can reproduce the same rhythm and beats in other languages too.