MyGPT — or how the first ChatGPT powered Alexa Skill was born

Peter Götz
6 min readJul 6, 2024

This is blog post #1 of a mini series of posts about my Alexa Skill that uses ChatGPT to make Alexa smart.

At the beginning of 2023, the world was still hyped by the advent of ChatGPT. But for some reason, no one had seriously made an effort to combine it with Alexa. That was strange, because ChatGPT was a purely language based assistant, and Alexa is a purely voice controlled assistant, just not as smart as ChatGPT. And so I set out to fill that gap and connect the two in an Alexa Skill.

The first prototype was rather quick to do. I already had a Wikipedia Skill (written in Go) publicly available (en, de, es). All I had to do, was fork it, and replace the API calls to Wikimedia with ones to OpenAI. It was literally done in a day.

And wow, what a result. Alexa suddenly felt like the Star Trek Computer. You could ask her arbitrary things, have actual conversations! Yes, just like all the jaw dropping things people experience when using ChatGPT the first time. But now you could… just talk!

And so, MyGPT was born.

If you want to try it out, before you continue to read, activate it for your country here: 🇩�� | 🇺�� | 🇨�� | 🇮�� | 🇬�� | 🇦�� | 🇫🇷 . These link directly to the successor skill that can be invoked using the name “ChatGPT”.

The MyGPT logo (later also used for its sibling skill “Chatbot powered by ChatGPT”)

Coming back to the implementation, the truth is, it wasn’t that easy. The prototype did work, yes, but it had many issues. They mostly had to do with how to deal with slow responses from OpenAI and general network connection issues. To get to 100% success rate on Alexa responses, I had to address a lot of subtle issues. The rest of this post explains these.

Some background on what an Alexa Skill is and how it works: Alexa Skills are the equivalent to apps on smartphones. Users need to activate them explicitly, and they get invoked by the user saying something like “Alexa, open My Encyclopedia”.

The way an Alexa Skill works, is, that when a user asks: “Alexa, ask MyGPT how tall the Eiffel tower is”, the device sends the audio to the Alexa service, the Alexa service transcribes it (speech to text), determines that it’s a request to a specific skill, sends it off to the skill’s endpoint, in MyGPT’s case an AWS Lambda function, and passes a JSON payload to it, that includes the utterance from the user. Now the Alexa service will wait for a maximum of 8s for a response from that endpoint. I.e., a JSON response payload that includes the written speech that Alexa should read to the user. If the response doesn’t come within 8s, Alexa will tell the user that there was an error with the skill and end the session.

It’s important to note here, that there’s no streaming option available with Alexa. It’s a classic synchronous request/response model. Mostly at least. But we’ll get to that in a second.

OpenAI, however, does offer a streaming API. So we can make an HTTP request against OpenAI and collect tokens until 7.5s have passed (a little bit of time buffer, just in case), cut off the connection here, and send back to the user whatever we got so far. That’s not a good user experience though yet. Instead, we need to cut off at the end of a sentence. In Go, there is a convenient library to identify the end of a sentence: neurosnap/sentences. With that, it’s possible for Alexa to give the response proper intonation, because it’s made up of complete sentences. We also know when the answer is incomplete, because we cut off the response from OpenAI. When it is incomplete, we need to ask the user if they want to hear more.

That solves the problem of sessions ending with an error due to timeout. However, asking the user is if they want to hear more, is certainly a controversial design choice about this Skill. Indeed, people complain about it often, wondering why the Skill doesn’t just keep on talking. Unfortunately, I believe there’s no perfect solution with Alexa’s 8s limit and the limited throughput OpenAI provides. So I settled with that approach.

The experience was still not ideal for another reason: For some answers, you’d have to wait for almost 8s, even though OpenAI provided a partial response much earlier. Waiting for 8s can seem like an eternity when talking to an assistant. In reality, this wasn’t acceptable. Some googling revealed, there is a way to “stream” responses to Alexa. It’s not really streaming, but instead you can send “Progressive Responses” using Alexa’s API. I.e., before providing your final answer, you can already send partial responses to the user. So I changed the implementation to not just iterating through the received tokens from OpenAI and collecting them, but also to send off every 1s a partial response including all complete sentences received so far.

That worked very well! With one final small issue: For some reason, Alexa reads progressive responses with a quieter voice than the final response. It really sounded awkward, because she would seem to start shouting at you for the last piece of response she gave you. Of course, the fix was easy: use the volume attribute in the response to turn up the progressive response.

Showing the median and p90 response time in milliseconds (not including progressive responses; these are even faster). As we can see, the median is mostly below 2.5s. It also shows that the slowest 10% requests can easily go to 7s and above.

At this point, I felt confident that the skill was good enough to be published. So I did submit it and now it’s available in six different countries in English and German (it’s sibling, Chatbot powered by ChatGPT, also supports French and is available in 7 countries).

The experience was still not perfect. Every once in while, OpenAI would be so slow, that not even a single sentence was available after 1s, 2s or 3s had passed. After a bit of experimenting, I inserted an “um… um…” in that case. However, people got annoyed by this, because it happened more often than I thought. Eventually, I changed it to a subtle jingle.

Finally, there’s error conditions where the HTTP connection just hangs. It took me quite a while to understand what’s going on. Fortunately, there’s a intuitive configuration parameter in Go’s HTTP client called timeout. Setting it to 7s made sure, the client would back out in time and the skill is able to tell the user, something is wrong and asking them if the Skill should try again.

HTTP errors are another issue. The simplest I found to deal with all of them, is to use Hashicorp’s excellent go-retryablehttp library and set retry wait times to very low values and only retry twice. Only this setting gives us fast enough reaction to try again, or back out completely.

retryableHttpClient := retryablehttp.NewClient()

retryableHttpClient.HTTPClient.Timeout = 7000 * time.Millisecond
retryableHttpClient.RetryMax = 2
retryableHttpClient.RetryWaitMin = 10 * time.Millisecond
retryableHttpClient.RetryWaitMax = 50 * time.Millisecond

Finally, to make sure it really never ever hangs for whatever reason, we wrap everything into a select statement with a timer:

chatCompletionTimer := time.NewTimer(7500 * time.Millisecond)
fullSentencesSinceLastProgressiveResponseChan := make(chan string, 1)
entireSpeechChan := make(chan string, 1)

go func() {
// get the actual responses from ChatGPT ...
// write to fullSentencesSinceLastProgressiveResponseChan,
// entireSpeechChan
}()

var fullSentencesSinceLastProgressiveResponse, entireSpeech string
select {
case <-chatCompletionTimer.C:
fullSentencesSinceLastProgressiveResponse = l.Get(r.TakingLongerToGetResponse)
entireSpeech = fullSentencesSinceLastProgressiveResponse
case fullSentencesSinceLastProgressiveResponse = <-fullSentencesSinceLastProgressiveResponseChan:
entireSpeech = <-entireSpeechChan
chatCompletionTimer.Stop()
}

With these changes, I got 100% success rate on responses. No timeouts, no errors.

And that’s it for this post. In the next posts, I will talk about how I introduced a feedback features, user frustration, a new skill name, different voices and more.

Again, if you want to try it out, activate it for your country here:

🇩�� | 🇺�� | 🇨�� | 🇮�� | 🇬🇧 | 🇦�� | 🇫�� .

--

--

Peter Götz

SDE at AWS. I breathe code. I'm blogging here privately, opinions are my own. https://petergoetz.dev