Teaching Alexa to speak with a Boston accent

Boston, MA

The Boston Accent

The Boston accent is more than just dropping the “R” sound everywhere except before a vowel. It is the local accent of Eastern New England English. What better way to delight an entire user base than by speaking their language. To speak “Boston English”, we need a language for speech. Enter Speech Synthesis Markup Language — or SSML, for short.

A Language for Speech

SSML helps us generate speech from text. A Text-to-Speech (TTS) system, which supports SSML, solves for the creation of life-like speech. To generate natural sounding voices, SSML can control the following:

  • Prosody: the rhythm (timing) and intonation (pitch) of speech
  • Breaks: pauses in speech
  • Emphasis: adjusted rate and volume of speech
  • Phonemes: phonetic pronunciations

Here’s some sample SSML markup:

…this SSML results in the following speech.

A Pronunciation Alphabet

Another option to control the pronunciation of text is the Pronunciation Lexicon Specification (PLS). PLS affords us the ability to create a vocabulary of terms and their respective pronunciations. The pronunciations are spelled out using an alphabet. For our purposes, we will use the International Phonetic Alphabet (IPA).

Here is a sample PLS file that we’ll use to perform a case-sensitive replace on words (grapheme) to their IPA equivalents (phoneme).

Amazon Polly

To generate our synthetic Boston accent with SSML & PLS, we will utilize Amazon Polly. Polly is an Amazon Web Service (AWS) solution to convert text into life-like speech. Hosted in the AWS Cloud, Polly currently provides 47 voices in 24 languages.

Polly can:

  • Accurately process text (e.g. acronyms, abbreviations, numbers, measurements)
  • Infer context (i.e. words that are spelled the same, but pronounced differently)
  • Produce highly intelligible, natural speech

The use cases of Polly vary from contact centers, educational materials, entertainment solutions, to the assistance of the blind.

PLS to Speech

Using the AWS SDK for JavaScript, we can output an MP3 file of some text using our PLS file (“boston”) and Polly. We then store the MP3 file (“harvard.mp3”) on a local file system.

…this results in the following:

SSML to Speech

Using the AWS SDK for JavaScript, we can output an MP3 file of some SSML using Polly. We then store the MP3 file (“blinkers.mp3”) in an AWS S3 bucket.

…this results in the following:

Lambda → Polly → S3

Upload the following GitHub project as a .zip file to AWS Lambda: https://github.com/IvanCampos/AmazonPolly*

*update creds.json to use your AWS credentials

When this code is triggered, it will upload a Polly generated speech file to an S3 bucket.

Alexa Skill → Lambda → S3

This S3 bucket’s MP3 file can then be read by AWS Lambda and output by an Amazon Echo with the following code:

To have Alexa trigger this code, visit the following site to wire up your Lambda Skill: https://developer.amazon.com/edw/home.html#/skills

If this is your first Alexa Skill, the following documentation should help get you going: https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/overviews/steps-to-build-a-custom-skill

To test your skill inside a web browser, check out https://echosim.io

Bringing it all together

Now we can have some fun with our solution.

Solution Architecture for Boston Accent Alexa Skill

Feel free to mix and match everything we’ve learned so far to generate your own sentences for Alexa. Here’s an example of the Echo speaking with a wicked Boston accent:

If you’re interested in learning more about voice UIs, check out the following post: