Alexa + Machine Learning = BeerBot

Lessons from engineering a voice interface in the real world

Terren Peterson

Published in

A Cloud Guru

9 min readNov 25, 2016

“Alexa, ask Beer Bot what beers are available at Midnight Brewery?”

“Give a man a beer, waste an hour. Teach a man to brew, and waste a lifetime.” — Bill Owen

Voice interfaces are taking off, but how advanced are they becoming? Are we at a point where they can become an automated agent, allowing us to put down our keypads and have a dialog just as if they were a friend or trusted colleague?

I’ve set out to test theses questions using Alexa by building a custom skill. BeerBot navigates the maze of microbrews to see how conversant — and accurate — Alexa is in the real world.

The microbrewery movement in the past twenty years has led to an explosion of beer choices. A great source for tracking them is the crowdsourced beer database BreweryDB. It‘s a huge collection spanning choices from every state, and it’s website gives the ability to navigate — but requires a keyboard.

BreweryDB is constantly updated by beer fans and breweries every day, and just needs the ability to come through a voice interface. Fortunately BreweryDB has API’s exposed for their data, so I’ve built a custom Alexa skill called “Beer Bot” that unlocks this information.

Leveraging Alexa Machine Learning

Amazon has enabled software developers to write custom skills that are published to the Alexa platform. These custom skills act as an application that can be invoked from the voice interface similar to basic features that come with the Alexa.

Here is an example of a phrase that invokes the Beer Bot application.

Alexa, ask Beer Bot what beers are available at Midnight Brewery?

The Alexa Skills Kit establishes the framework for how the powerful machine learning algorithms that Amazon has developed can be leveraged.

A key concept within the platform is around the learning algorithms, and teaching Alexa what “might” be the expected outcome of a particular phrase. A simple example to understand this pattern is the following.

My favorite color is {COLOR_SLOT}

In Alexa terminology, the overall phrase is called an “Utterance”, and what’s in the brackets is referred to as a “Slot”. When someone states “My favorite color is blue” or “My favorite color is red”, both of these have the same intent, it’s just that the color is a variable that is defined in the slot. When modeling the application, the developer will establish possible choices that might be in the slot (i.e. Blue, Red, Yellow, Pink, Green, etc.)

This structure is outlined by the developer as part of writing a custom skill, and once approved by the Alexa team, the model (including custom slots) is ingested into the platform which then influences the algorithms. This type of teaching is common in machine learning, and is useful for establishing patterns.

Navigation of 170k possible words in the English language

Pattern Matching

Once Alexa deciphers the spoken word, it translates into one of these patterns, then invokes the API provided by the developer, passing over which pattern was uttered, along with any variables from slots.

There are approximately 170k words in the English language (per the most recent Oxford English Dictionary) so this modeling provides structure and patterns based on context of the likely choices. For example above, it is context that determines when someone says “Blue” vs. “Blew”, and the slot detail establishes that for the question above.

When authoring an Alexa Skill (the custom “app”) the challenge is how to establish all the different ways in which a question or statement could be made, then building custom slots for variables within the utterance.

That’s where the engineering comes in by the developer to take advantage of the underlying machine learning in the Alexa platform. If the pattern matching isn’t effective, that’s a problem with the machine learning and the platform itself.

For Beer Bot, leveraging the capabilities of the platform includes building custom slots for variables like Microbreweries for which there are more than 5000 different choices. This data is extracted from the BreweryDB, and then pushed into the skill at publishing time so that the machine learning algorithms can learn what are the likely choices, which then get passed back to the API where the response can be authored.

Thoughtful Code Matters

Once the processing is done on Alexa, an API call is made to a micro-service developed by the author of the skill. The quality of the user interaction is very dependent upon how “flexible” the skill is written, and that effort has been put into understanding the difference between a visual/keyboard interface and a voice driven one.

Here are some lessons I’ve learned so far. I’m very interested to hear from others what they have experienced — please drop a comment.

Lesson #1 — Words get dropped when context has been established.

If we’re having a conversation about beers, in a natural dialog, we tend not to be legalistic in what the name of a brewery is. For example -

What beers does Legend have?

Now technically the full name in the beer database and custom slot is “Legend Brewing Company” but to be natural, we should be prepared for common words like Company or even Brewing to be left off during the dialog and still be able to formulate a proper response. It’s extra code to write, but well worth it to improve the matching of the user intent.

Legend Brewing Company in Richmond, Virginia

Lesson #2 — We use shortcuts for well known locations and events.

Not all formal names are the same, and in a dialog, we assume that a certain amount of knowledge exists. For example, here’s a question that comes up with the bot.

What microbreweries are in Seattle?

Now ideally the user would provide the city/state combination, and in a text/keyboard driven world, we would include redundant dropdown boxes that require both to be provided. A voice interface is more open ended, and a well written skill should be flexible to handle intent like this.

If a well-known city is provided, the skill should fill in the gaps rather than re-prompt the user for more detail.

Beautiful skyline view of Seattle {Washington}

Lesson #3 — We don’t always answer the exact question posed.

This can cause friction in a human-human conversation if taken to an extreme, but not unexpected in the right context and where we just “go with the flow”.

When navigating the 5000+ microbreweries, there can be dialog around trying to find one using geography. The narrative within the bot goes along the following lines.

Bot: What city would you like information for?
Response: What are the microbreweries in Delaware?

This is where humans tend to apply good judgement given context, and understand where a dialog goes depending on specifics. Computers tend to want an explicit response (think filling in data to a field). So for natural interactions, need to be able to handle responses that are within a range of outcomes.

There are only ten microbreweries listed in all of Delaware whereas there are twenty-three in Charlotte, North Carolina. So for a small state, going down to the city level isn’t really needed, and the more natural response if a state name is given is to just provide the list versus giving an error message that Delaware is not a city.

Lesson #4 — Nobody wants to hear a five-minute monologue response.

A bot can get rather verbose in a response, particularly if the information being provided normally would come across on a screen where a visual scan of the information could be made. An example of this is for some of the larger microbreweries.

What are the beers for Great Lakes Brewing?

Given that there are more than one hundred different beers made by Great Lakes, the math becomes brutal on how long a response would take if comprehensive.

Even when parsing out redundant words to get a verbal reading of a beer name down to around two or three seconds, the reading of one hundred beer names and styles would take as much as five minutes.

The skill currently sets limits on how many beers will be returned, thus limits how long this can run on for.

Some of the one hundred choices with a top Microbrewery

Can we drop our keyboards yet?

The new version of the skill just got published, so looking to see how the analytics highlight usage patterns as well as what feedback gets posted in the store.

In testing the custom slot, I was very impressed by how well the machine learning was able to match what was in the custom slot. For example, there are more than fifty different microbreweries starting with the phrase “Black” in them, and Alexa was able to accurately decipher each one.

Give BeerBot a Try

There are ten choices to pick from that have similar names, and are in different parts of the country. Black Acre Brewing, Black Bottle Brewing, Black Couch Brewing, Black Fox Meadery, Black Lotus Brewing, Black Star Farms, Black Swan Brewpub, Black Tooth Brewery, Blackberry Farm Brewery, Blackstone Brewing.

Just say something like “What’s on tap at Black Acre Brewing?” and see how it works for you.

Logo for Black Acre Brewing Co. from Indianapolis, Indiana

In my testing of the use case above, Alexa gets all ten correct, and properly returns results for each one. Given that some of these words aren’t common in the English language — such as Meadery, Lotus, Acre — and that some phrases could be split into different words despite having the same utterance (Blackberry vs. Black Berry), the machine learning is quite effective.

My assessment is that the learning algorithms are accurate and ready for a bot model — they just need to be correctly exploited by the software developers.

A Cloud Guru Will Drink to That!

If you are interested in developing your own Alexa custom skill, A Cloud Guru has great courses available which cover all the core AWS services, including Alexa development.

For a more thorough writeup on how the BeerBot skill works, be sure to read the article in hackster.io.

Terren Peterson is an experienced technology executive with over twenty years of experience in consulting, start-up, and large corporate environments. He is currently the VP of Cloud Engineering for the Retail and Direct Bank Business at Capital One.

Terren is currently developing interactive voice applications using the Alexa platform. He has created multiple Alexa skills. Most recently, he integrated Alexa Voice Service with a Raspberry Pi to create Roxie, the voice-activated pitching machine that won first place in the Best ASK with Raspberry Pi segment of Alexa’s Internet of Voice Challenge on Hackster.io. Terren is now experimenting with the analytics capabilities of Alexa to understand and improve skill usage.

Terren holds a Bachelor of Science in Electrical Engineering from the University of Illinois at Urbana-Champaign. He was the founder of the Digital Campus Lab for Capital One at the UIUC Research Park, and serves on the board of the Hoeft Technology & Management Program. Terren also holds both Architect and Developer AWS Certifications.