A/B-Testing your Alexa Skill

Published in

#VoiceFirst Games

6 min readAug 23, 2019

In this article I introduce my architecture for A/B-testing Alexa Skills, along with an open-source framework-independent proof of concept with detailed setup instructions. I hope this will enable voice developers to better align their Skills with customer preferences after launch.

Maximizing customer value

There are different possible motivations to build a voice app, such as the joy of development and earning bragging rights / developer perks, but the foremost one should be to provide value to customers. Even if the main motivation is to profit from a Skill by monetization or increasing a brand’s reach, great customer value is how to get there.
The value a voice app provides is a result of both the strenght of the use case, i.e. how frequent and how big is the benefit of using the Skill, and the quality of its execution, i.e. the quality of the voice user experience.
The quality of a Skill’s execution has numerous (fractal) dimensions, some of which are unique to certain verticals (like game mechanics in case of voice games). This article is not about how to optimize a particular dimension of a Skill’s execution, but about the general approach to optimizing anything.

Qualitative vs quantitative methods

So far, most of the resources on how to design a voice app focus on qualitative approaches, such as Wizard-of-Oz prototyping (in early design stages), user testing (in later design stages and potentiall after launch) and monitoring reviews (after launch).
These approaches undoubtedly have their merits, especially in understanding how customers interact with a Skill. Especially teams with higher budget use them routinely, but one charming thing about them is that even very small test groups can already surface UX issues.

On the other hand, quantitative approaches are about measuring user behavior, aggregating data, and then deriving actionable results. Such approaches are far less common, with the most common approach being kind of a cohort analysis where behavior of users before and after the launch of a new feature is compared, typically using first-party analytics data from the Alexa Developer Console.
As someone who is familiar with both data science (I studied and worked in bio-informatics) and digital marketing (I worked in email marketing before going all in on Alexa), I love the empirical beauty of A/B-testing: It’s a controlled experiment where two (or more) sets of users from the same statistical population use versions of the same app that differ in one aspect only.
A/B-tests require more resources for planning and setup, but they pay off by delivering a statistically sound basis for making data-driven design decisions.

Architecture of an A/B-test system

I personally long wished to work with A/B-testing for developing voice apps, but wasn’t sure of how to approach it in terms of architecture and infrastructure. After having taken some time for it in the last days, I can finally present a scheme of an A/B-tested Alexa Skill, along with an open-source proof of concept (POC).
Before going into the details, here are some key constraints and requirements that form the boundaries of my solution:

There’s a routing system that consistently assigns users to versions, so that the same user always gets the same version
The routing system allows a variable fraction of users for the different versions (e.g. 80% get version A and 20% get version B)
The routing system should be easily adaptable for more than two splits (like an A/B/C-test)
The routing system should not be built into the business logic of the app (to keep the code clean)
There needs to be a way to store observations, in order to aggregate data and apply inferential statistics
The system should be re-usable, such that after one A/B-test ist finished, the next one can start. For re-usability, it’s crucially important that users have a different assignment to versions per test (e.g. this test’s group A may not be the same as the next test’s group A).
Optimally, it should be easy to deploy and maintain

So finally, here’s the blueprint of my A/B-testing architecture:

Blueprint of an A/B-testing system for an Alexa Skill. Blueprint made with cloudcraft.co .

Let’s investigate the components first, and then the interactions:

First there’s the Alexa components (Alexa built-in device, Alexa Voice Service, and Alexa Skill). These are configured as usual, with the Skill’s endpoint configured with a Lambda ARN
The ‘Router’ Lambda function, which serves as the routing system described above. It needs to have the Alexa Skills Kit as an allowed trigger.
Two separate Lambda functions, which run versions A and B of the Skill’s backend. Those do not need to have the Alexa Skills Kit as an allowed trigger.
A result store, which can be any database or online spreadsheet service. In this case, and because it’s how I built the POC, it’s represented by a DynamoDB database.
Shared resources like a DynamoDB database for user data. This is optional, and the different versions could even use own resources if that’s part of the A/B-test, e.g. if version A distributes content with S3 and version B with Cloudfront.

Knowing these components, the interactions become quite apparent. The arrows point in the direction in which the input propagates through the system, but of course while building a response data also flow the opposite way.

The blue arrows indicate how a spoken user utterance is transformed to JSON-structured Alexa Skill request
The ‘Router’ Lambda receives the request, uses the user ID (or the session ID for demonstration purposes in case of the POC) to determine if the request should be routed to version A or B, and sends the request payload to the according Lambda function along the crimson arrows.
Both versions’ Lambdas (but only one per request) apply their business logic to teh request to construct a meaningful response. In the progress, they may use shared resources along the orange arrows.
The versions’ Lambdas send their JSON-structured response back to the ‘Router’ Lambda, which in turn sends it back to the Alexa Voice Service. This happens along the reverse direction of the crimson and blue arrows.
The versions’ Lambdas measure pre-defined apsects of the user behavior and store these observations in the result store along the violet arrows, along with an identifier of the respective version.

This architecture satisfies the constraints and requirements from above. It comes with the additional benefit of being completely independent of the versions’ implementation, e.g. wether it’s built with an official SDK for Node.js or Python, or with the Jovo framework.

Disadvantages of this architecture

This setup has some disadvantages and costs, which I hope are offset by the value of the results, and can possibly be mitigated. Here are those that I am currently aware of:

Additional latency: The biggest latency comes from the respective version’s Lambda, but the router Lambda adds another couple of hundres milliseconds on top. This might become problematic if both Lambdas are cold and have to be booted up.
One approach of mitigating this is to keep all three Lambdas ‘warm’ by sending them scheduled dummy requests.
Higher consumption: Despite Lambda being very cheap and having a generous free tier, this setup more than doubles the resources consumed per request (because the router Lambda calls the respective version’s Lambda asynchronously, i.e. it waits for the answer, and the developer is billed for the waiting time).
I’m not aware of how to mitigate this effect.
Higher complexity: If an error occurs in the business logic, tracking down the error becomes more difficult, since it could have occured in any of the three Lambdas.
This can be mitigated using Cloudwatch Alerts.

Request for comments

While I am confident that this is a functional and robust solution for A/B-testing voice apps, I am sure that there are ways in which it can be improved that I am not aware of. If anything comes to your mind, please reach out and let me know!

If you want to try it out for yourself, you can find the repository for a proof-of-concept system here, with detailed setup instructions.

Conclusion

I presented an architecture of an A/B-testing system for Alexa Skills, which has a few drawbacks but of which I think it’s generally suitable for production systems.
One possible extensions is a way to manage different A/B-tests, to keep track of past, ongoing and planned experiements.

I hope that this setup establishes A/B-testing as a routinely used tool for voice developers to generate data-driven design decisions and ultimately build more valuable Skills for customers.