Hey Alexa, play my Spotify playlists

Aural space is shared, and should be treated as such

10 min readAug 11, 2017

Earlier this year, Walt Mossberg wrote his final weekly column, “The Disappearing Computer”, and I highly recommend giving it a read. The gist is that eventually, the interfaces we interact with will become increasingly ambient and device-agnostic, and the services we engage with daily will come to us, rather than requiring we find the appropriate app on our phone’s home screen. The best technology is that which gets out of your way and does not make its presence known.

This all sounds great in theory: the notion that our cloud-based assistants will follow us from our desktop computers to our cars to our refrigerators, ready to anticipate our every need. However, there’s one big problem that no company working in this space has adequately addressed: the problem of ownership, the who’s who of Voice User Interfaces (VUIs).

Devices and cloud services tied to accounts are deeply specific to you as a specific person (your email account, your iPhone, etc. are all strictly “yours”), but what about “smart” speakers that can be set atop a table in a shared common space? Anyone can talk to these devices by being in the room, but these internet-connected cylinders are simply not well-equipped to serve anyone other than their direct owners.

Echo on the left, Google Home on the right — via RoboticsTrends

My Spotify, or your Spotify?

If I say, “Alexa, play Pink Floyd” to an Echo device in a room, music will start playing from whatever the preferred music source is, whether that be the default of Amazon Prime Music or if the owner of the device has changed it to Spotify (or another alternative). The music will be playing from the Spotify account of the person who owns the device as the primary user, so if I am not that person, the music I just started listening to will not show in my Spotify app’s recent history, even though I was the person who issued the command in the first place. (Note: I can issue the command “Alexa, Spotify Connect” to force the Echo to be discoverable in the Spotify app on my phone as a valid source for audio, but that’s hardly elegant)

Flipping this around: if I am the owner of the device, and another person starts issuing Spotify commands to my Echo, that playback history is now part of my personal Spotify account, which Spotify may then take as a signal for the tastes I have. This in turn feeds into their algorithm-generated playlists such as Discover Weekly and Daily Mixes, along with other recommendations they may make.

(Aside: When services like Spotify claim to learn about you from your habits without providing you a way to go back and indicate that you do not want a particular track/album/episode to be taken as a signal for interest, this can be anxiety-inducing if you want your recommendations to maintain some modicum of relevance.)

Don’t Leave Friends Unattended with Alexa

On this note, a common scenario at a gathering with friends at home: someone sees an Echo device, and starts challenging it with commands such as “turn off the lights”, “buy 3,000 rolls of toilet paper”, “play William Hung”, and so on. While funny in a shallow sense, this exposes the core issue with these devices. “Authority” over the voice-activated devices in your home is rooted in being within earshot.

This is not to say that Amazon and others are standing completely still in the VUI space. The Echo Show was announced earlier this year; however, the interaction model involved is still very much single-user, per The Verge:

Amazon is still sticking to a single-user system for the Echo Show. So while the device has improved functionality in the form of calendars and taking calls, if you have multiple Echo Shows and Echo speakers, it’s still an all-or-nothing proposition. So when someone calls you over the Alexa messaging system, all your devices will ring, and anyone in your house will have access to your shopping list.

You could argue that these devices are really just a modern-day, more capable analog of landline home telephones, in that anyone in the room could use them. Of course, this comparison breaks down on the question of ownership as it applies to additional capabilities beyond basic communication. The fact that friends can call you via the Alexa’s calling and messaging platform is not a problem, but accessing and modifying data with your online accounts is where this really gets convoluted. “Alexa, what is my Capital One credit card balance?” is absolutely something that only you should be able to ask…

My work calendar, my personal calendar, or one of yours?

Much of the data that users of technology want access to includes things that are inherently personal in nature, such as one’s own calendar — as of March this year, Echo devices can rattle off events from an Office 365 account’s calendar. They can also connect to Google Calendars, as well as those on Outlook.com. However, you as a user likely care only about “what is on your calendar”, rather than “what is on your calendar according to data source X versus data source Y versus data source Z”.

In a moment, you can be doing exactly one thing, so it stands to reason that VUIs should be abstracting away the complexity of our digital existence rather than requiring us maintain a mental model and mapping of all the disparate internet services we use.

What if you have multiple people under the same roof who care about where they need to be and at what time? If you have roommates or a significant other living with you, there being a specific owner of these ambient devices makes them more complicated to deal with than being (as promised) “internet things you can talk at”. Saying “Alexa, add dinner with Bob at 7 pm to my calendar” simply cannot be followed up with guesses as to which user you are, and then which account of yours (work v. personal, etc.) you want the event saved to. By the time you slog through the mental flowchart of clarifying these things, you might as well have whipped your phone out of your pocket and punched in the information (or used your phone’s personal assistant) in a manner that you can be sure is correctly recorded.

Enter Google

To pivot away from talking exclusively about Amazon devices (to which I admit I have my own personal bias to, given they are what I have the most experience with), let’s look at the Google Home by focusing on what it does differently from the Echo ecosystem.

Google Home can tell the difference between voices (up to 6 in your household) by effectively “fingerprinting” based on the way you speak. This is definitely a step in the right direction toward acknowledging that VUI devices are a part of the home rather than a specific person’s technology arsenal, since this “voice fingerprinting” can then, for instance, map the correct music playback intent to the correct account on the correct playback service. If I prefer Spotify and someone else prefers Google Play Music and either one of us says, “Play Green Day”, then the Google Home can map that listening access and history to the correct person’s account on the correct service.

Building on top of this, in addition to having some number of users who are verified members of your household, this voice recognition could then be used to wall off certain features from unrecognized voices, those of guests visiting your home. For example, if I am at a friend’s house, perhaps the Echo or Home device would allow me (as an unrecognized user) to play music from my host’s Spotify account (and Spotify could go the extra mile of not storing this playback history as part of my profile), but disallow me from asking about calendar events and credit card balances.

However, Google still hasn’t addressed the problem that work and personal accounts are very much two reconcilable halves of the same person’s identity. Many people use Gmail accounts for their personal email and calendaring in tandem with Google accounts that are provisioned by their workplace. I as a user of Google services span a work and a personal existence, and having a device that makes me pick which universe I am able to get information from is a fundamentally broken experience.

What about Apple? And, Cortana-powered speakers?

While there are other internet-connected, voice-assistant-based devices on the market, and many more are coming, they all suffer from the same aforementioned problems.

Apple’s HomePod speaker is really being pitched as just a quality speaker for music in your home, and happens to be able to do some things via Siri that are a limited subset of the capabilities available to Siri on your iPhone. Why holler commands across the room at a device you know does not do everything your phone can, with no way of knowing (without first trying) what commands are and are not supported?

The Harman Kardon Invoke — best feature? You can use Skype without having to look at it.

Far as the “Cortana-powered” speaker devices that are on the market, there is little reason to believe that the interaction model here will improve on anything we have already seen. This is especially the case as Microsoft has not opted to create a 1st party speaker option as a reference device for manufacturers to emulate in the way that Amazon has. Call me pessimistic, but I find it highly unlikely that the Harman Kardon Cortana speaker when it launches is going to be able to give others a run for their money when Amazon and Google are actually invested in building out their entire hardware and software stack in unison.

Lots of work to do

The problem of ownership is certainly not the only issue with these sort of devices — they are also generally not very good at taking vague natural language queries and inferring the precise intent behind them. So much so, that SNL did a skit on what a revision to the Amazon Echo might look like, were its target audience on the older side:

If at 10pm, I say, “Alexa, wake me up at 6”, I get a very verbose and annoying response: “Is that six in the morning, or six in the evening?” The Echo devices are too cautious about the flowchart of options for their users, requiring hand-selection every step of the way when a far simpler mitigation would be to realize that the command was issued at night, and the user therefore probably wants to be woken up the next morning. The Echo could then say, “your alarm has been set for 6am tomorrow”, at which point you would be able to know that this was done successfully or if you need to issue a more precise command in the odd event you do wish to be woken at 6pm.

Another area for improvement is the ability to reference past conversation lines in a natural fashion. For example:

Me: “Who was the 7th president of the United States?”
Echo: “Andrew Jackson”
Me: “When was he born?” [notice, I used a pronoun since it should be clear who I am talking about]
Echo: “March 15, 1767” [nice, this went better than expected so far]
Me: “Who was president after him?” [and here is where things fall apart]
Echo: “Here some information I found on Wikipedia: *random factoid here from the year 2015*”

This is something can be expected to improve over time, which brings focus to one of the key advantages (and disadvantages) of these VUI-home speakers: the interface is completely audio-based.

Why does this matter? It means that automatic software updates can overnight give your devices more capabilities, without having to rely on screen real estate for where to put a new button for a new feature. Unlike a traditional keyboard and mouse or touch interface, the amount of commands that can be issued from a screen is fairly limited by what can be shown. Apps get more complicated when you add many more capabilities tied to visual interfaces, whereas VUIs actually get *simpler* with more capabilities and paths.

When dealing with VUIs, you can simply ask for what you want, and this is why the idea of ambient computing in the home, car, and office is such a powerful vision.

Today, however, the set of commands that are understood is very limited, as are the types of data sources that can be wired up to meet those needs and inquiries. There is not a guarantee that a command that you issue will be in the lexicon of the particular hardware unit you are talking to, so this can quickly become frustrating in 2017.

To mitigate this for the basic needs of a new user, Amazon and other manufacturers tend to ship their devices with a “Quick Start” informational card in the box that itemizes common commands, such as setting alarms, to do lists, and asking how tall celebrities are (very important, naturally — Echo devices always report back this measurement in inches and centimeters, because why not?). This is an admission on their part that these devices are not good enough to understand arbitrary commands from people, nor can they expect most folks to learn an invisible interface by trial and error.

Learnability of a VUI that has a limited understanding of our vocabulary and speech patterns (and who is actually speaking!) is very much a problem today, and you, the user, are not what needs to get better.

Next Post: How Autonomous Driving Will Change Our Cities

Thanks for reading!

Blake <@b_t_walsh>

Hey Alexa, play *my* Spotify playlists