Teaching Alexa to speak with a Boston accent
The Boston Accent
The Boston accent is more than just dropping the “R” sound everywhere except before a vowel. It is the local accent of Eastern New England English. What better way to delight an entire user base than by speaking their language. To speak “Boston English”, we need a language for speech. Enter Speech Synthesis Markup Language — or SSML, for short.
A Language for Speech
SSML helps us generate speech from text. A Text-to-Speech (TTS) system, which supports SSML, solves for the creation of life-like speech. To generate natural sounding voices, SSML can control the following:
- Prosody: the rhythm (timing) and intonation (pitch) of speech
- Breaks: pauses in speech
- Emphasis: adjusted rate and volume of speech
- Phonemes: phonetic pronunciations
Here’s some sample SSML markup:
…this SSML results in the following speech.
A Pronunciation Alphabet
Another option to control the pronunciation of text is the Pronunciation Lexicon Specification (PLS). PLS affords us the ability to create a vocabulary of terms and their respective pronunciations. The pronunciations are spelled out using an alphabet. For our purposes, we will use the International Phonetic Alphabet (IPA).
Here is a sample PLS file that we’ll use to perform a case-sensitive replace on words (grapheme) to their IPA equivalents (phoneme).
To generate our synthetic Boston accent with SSML & PLS, we will utilize Amazon Polly. Polly is an Amazon Web Service (AWS) solution to convert text into life-like speech. Hosted in the AWS Cloud, Polly currently provides 47 voices in 24 languages.
- Accurately process text (e.g. acronyms, abbreviations, numbers, measurements)
- Infer context (i.e. words that are spelled the same, but pronounced differently)
- Produce highly intelligible, natural speech
The use cases of Polly vary from contact centers, educational materials, entertainment solutions, to the assistance of the blind.
PLS to Speech
…this results in the following:
SSML to Speech
…this results in the following:
Lambda → Polly → S3
Upload the following GitHub project as a .zip file to AWS Lambda: https://github.com/IvanCampos/AmazonPolly*
*update creds.json to use your AWS credentials
When this code is triggered, it will upload a Polly generated speech file to an S3 bucket.
Alexa Skill → Lambda → S3
This S3 bucket’s MP3 file can then be read by AWS Lambda and output by an Amazon Echo with the following code:
To have Alexa trigger this code, visit the following site to wire up your Lambda Skill: https://developer.amazon.com/edw/home.html#/skills
If this is your first Alexa Skill, the following documentation should help get you going: https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/overviews/steps-to-build-a-custom-skill
To test your skill inside a web browser, check out https://echosim.io
Bringing it all together
Now we can have some fun with our solution.
Feel free to mix and match everything we’ve learned so far to generate your own sentences for Alexa. Here’s an example of the Echo speaking with a wicked Boston accent:
If you’re interested in learning more about voice UIs, check out the following posts:
In a previous post, a technical introduction on what it takes to teach your Amazon Echo to speak with a Boston Accent…medium.com