Optimising Text-To-Speech Performance With Parallelised Sentence Streaming

Paulo Taylor
4 min readAug 15, 2023

--

Conversional AI has its own set of challenges but if you add Conversational AI over the phone it adds an extra set of challenges, mostly because there’s is no UI, so it makes responsiveness a paramount aspect.

I’ve been dealing with that sort of challenges lately when developing Voyp, a mobile app that acts as AI assistant that makes calls on behalf of users. Trying to reduce latency has been one of the goals since the beginning. Most of it came down to having most of the operations done on the device that includes Text-To-Speech, Speech-To-Text and Generative AI APIs.

Initially I was using Anthropic’s Claude using the instant model that seemed to outperform OpenAI’s ChatGPT. In the meanwhile there have been many updates between these two and frankly the difference between both providers isn’t that big of a deal at this point.

The recent optimization introduced on that front was not by deciding which provider to use but rather exploring a common feature which is streaming.

One of the issues with Voyp is the time it needs to wait for the response generated by the AI model. Streaming the response allows receiving the response incrementally word by word. You may wonder what’s the point since there’s no UI what are the true benefits of such approach?

This technique which I baptised Parallelised Sentence Streaming allows starting the Synthesis of text into audio before receiving the full text response from the generative AI model. Let’s start with an example:

Hello Matthew, I’m calling on behalf of Paulo Taylor. I would like to make a table reservation for tomorrow. Could you assist me with that please?

The first step I called Parallelised Sentence Synthesis which consists of splitting the paragraph into smaller sentences and synthesise each sentence in parallel and then stitch them together.

S1: Hello Matthew, I’m calling on behalf of Paulo Taylor.
S2: I would like to make a table reservation for tomorrow.
S3: Could you assist me with that please?

S1, S2 and S3 synthesis start in parallel and by the time S1 stops playing, S2 should be ready to play and the same thing with S3. There are some nuances to take care of, particularly making sure that everything stays in the same order but the general idea remains. Synthesising the full sentence definitely takes more time than this method and there’s some interesting gains in performance using this approach.

Adding this concept to the streaming functionality of ChatGPT or Claude saves even more milliseconds which are precious when we’re interacting with users without a User Interface.

When streaming, the Generative AI response is typically received word by word. Given the previous example it would look something similar to this:

Hello
Matthew,
I’m
calling
on
behalf
of
Paulo
Taylor
. ← Start S1 text synthesis to audio
I
would
like
to
make
a
table
reservation
for
tomorrow
. ← Start S2 text synthesis to audio
Could
you
assist
me
with
that
please
? ← Start S3 text synthesis to audio

This effectively means that using this approach it can start playing audio responses to the other side of the phone line before it received the full response from the AI model. As you can imagine, this increases the performance substantially saving many hundreds of milliseconds up to a few seconds at times. This is Parallelised Sentence Streaming

There are other considerations to have when detecting the end of sentences because simply relying on punctuation marks is not always enough. For example:

Hello Matthew, I’m calling on behalf of Mr. Paulo Taylor. I would like to make a table reservation for tomorrow at 8 p.m. Could you assist me with that please?

S1: Hello Matthew, I’m calling on behalf of Mr.
S2: Paulo Taylor.
S3: I would like to make a table reservation for tomorrow at 8 p.
S4: m.
S5: Could you assist me with that please?

So, as you can imagine there are a few challenges that require some attention and there are many ways to go around them but the core concept laid out here will surely help bring down the latency and eliminate a few millisecond if not seconds which is an important aspect when building interfaces without graphic cues like what happens when you’re having a conversation over the phone.

If you want to see it in action here’s a demo video

Thank you for reading

--

--