The Future is Here — Life with Alexa

Vimal Bhalodia
11 min readMar 5, 2015

--

Three months ago (December 2014), I acquired an Amazon Echo. Yes, the promo videos were lame. Yes, I’m sure Jeff Bezos is listening to every inane conversation that happens in my apartment. And yes, it’s still unreasonably difficult to acquire one.

But I don’t care — this is the only gadget released in the past 5 years which actually makes me feel like I’m living in the future!

The ship computer from Star Trek is finally real, and her name is Alexa.

Rather than rehash all its features and capabilities, I’m just going to highlight things that I actually care about as well as my wishlist of features that would make the Echo even more awesome.

Ambient Voice Control

Prior to the Echo, I have never used voice control to interface with anything. Or more accurately, I have never preferred to use voice control interfaces for one simple reason — if I’m within poking distance of a widget, it’s universally faster, easier, and more convenient for me to poke it.

Two things about the Echo make me able to use it from anywhere — it’s always listening and its microphone array does a great job of hearing me from anywhere in my kitchen/living room. There is absolutely no reason for me to walk over to my desk or reach into my pocket when I can just as easily speak into the ether.

Personal Assistant

Asking “what’s the killer feature for the Echo” is like asking “what’s the killer feature for a personal assistant”. There isn’t one, and it’s unlikely that there ever will be one. The Echo is a mishmash of random little capabilities that I just start integrating into my daily life.

Popped some brownies into the oven? “Alexa — set a timer for 25min” so that I remember to take them out before they burn.

Haters got me down? “Alexa — play Shake it Off by Taylor Swift”.

Want to mock my East Coast friends? “Alexa — what is the weather in New York City.”

Helping my friends prep for management consulting interviews? “Alexa — what is the population of the United States”

But to level-set expectations, the Echo is not an eager little Harvard MBA kissing my executive butt whenever I want. It’s more like a well-behaved toddler — it has a very limited understanding of the world and it may or may not always do what I tell it to.

To improve accuracy, I do actually treat it like a toddler — speaking slowly and deliberately, raising my voice, and occasionally repeating myself. FWIW, this technique works on Harvard MBAs as well.

Mobile App Anti-Feature

The Echo comes with a companion mobile app which helps with initial setup and configuration. It also serves as an extension to the UX for a number of features, most notably todo-lists, shopping lists, and general web searches. But I don’t count any of those as actual Echo features.

Every interaction with the mobile app is an anti-feature. If the user experience for any feature cannot be completed exclusively over the voice interface, then it’s useless to me.

As the Echo team releases new features, I’m keeping an eye on how many of them are truly standalone vs require use of the mobile app. The more standalone features introduced, the greater the chance that the Echo may actually become the next iPhone. But if more app-based features are introduced, then I’m selling my Echo on eBay before it goes the way of Google Glass and the Fire Phone.

Hardware Wishlist

The Echo is a very well-designed and implemented piece of hardware — it’s compact, functional, and subtle. Based on how I use it, there are only two changes I would make.

First, the built-in Echo speaker. It’s “good” — worlds better than pretty much every speaker built into any mobile device or laptop I own, and probably comparable to the speakers that comes built into my television. But it honestly does not come close to even my rather cheap home stereo setup — I am all about that bass, and Meghan Trainor just sounds better coming from real speakers.

Having a standard line-out jack to hook up external speakers would be awesome. I can imagine why this might not be a trivial feature — onboard microphone noise cancellation may not be easily adaptable to random external audio setups — but still it would make the Echo much more compelling as a home audio product.

Speaking of home audio product — despite the overall lower audio quality, I listen to 100% of my music through my Echo when I’m at home. The voice interface for music control is just so natural and convenient that I’m willing to compromise on the actual audio quality.

The second change I would make is to Bluetooth pairing. Amazon very proudly advertises the ability for the Echo to pair with your phone over Bluetooth. What they don’t tell you is that it pairs as a Bluetooth speaker — major disappointment.

I mentioned this before, but I should reiterate — the microphone quality on the Echo is amazing. Incredible. Better than even the stupidly expensive conference room Polycoms and certainly worlds better than anything built into any phone. I want my Echo to connect to my phone and pretend to be a Bluetooth headset so that when someone calls me, I can say “Alexa, answer” from wherever I am and just start talking.

Software Wishlist

There are so many amazing things the Echo has the potential to do, I don’t even know where to begin. So instead, I’ll focus on just incremental extensions to its existing features — things I’d imagine could be delivered in months rather than years and built entirely out of existing technologies and already-solved problems.

First up — the todo list. It’s stupid. Or, at least, it’s stupid in its current form. At the very least, it should take advantage of the alarm/timer features and evolve into a “reminder” system.

V1 would simply be time-based reminders — “Alexa — in 25 minutes, remind me to take the brownies out of the oven” and 25 minutes later, a pleasant voice says “take the brownies out of the oven” (or more likely “make the bow tie out” — general purpose speech recognition on the Echo is still a bit hit or miss).

V2 might be the exception to my “no app-based features” rule by supporting contextual reminders or integrating with 3rd party reminder/todo apps that have contextual APIs. “Alexa — remind me to buy milk when I go to Safeway”, and next time I’m within geofenced proximity to a Safeway (or to a suitable competitor that pays Amazon lots of money), a notification pops up on my phone.

Next up — home automation integrations. Nest thermostats, Philips Hue lights, Belkin WeMo Crock-Pots — if we are early adopters of the Echo, we are probably early adopters of all these other gadgets, and they all are perfect candidates for integration because they have open APIs and simple command grammars. Plus they make for cool demos.

I would be shocked if some of these integrations are not rolled out within a few months — I mean, I already have a working Philips Hue light control setup, and that took maybe an hour of total coding time?

Seriously, after making that hack that work, the thought of having to actually walk over to a physical wall switch to turn my lights on and off feels so…primitive.

Finally, the flash briefing — a regularly updated and consolidated pull of headlines and news items from around the internet. I set it up and used it maybe once? Google Now and Nest have set the user experience bar, and the flash briefing feature in its current form just doesn’t measure up. Fixing it is probably the most pie-in-the-sky item on my current wishlist

The biggest change in flash briefing V1 would be to source data not just from general public news outlets, but also from my personal data sources — calendar, todo apps, Amazon purchases, etc. Google, Facebook, LinkedIn, Microsoft, the NSA, and a half-dozen eastern European identity theft rings are already reading my e-mail — why shouldn’t I invite Amazon to the party?

As for V2 — this is where things get interesting. I don’t want to actually ask Alexa for a flash briefing every day — I want it to take a guess at when would be an appropriate time to throw updates at me, and learn from when I tell it to hush, much like the Nest. Candidate times for updates could be guessed based not just on wall clock time, but also events triggered by other devices including fitness trackers or alarm clocks (update me 10min after I wake up), phones or BLE beacons (update me when I get home from work), and even coffee makers (update me when I drink my morning coffee). This is a risky feature — done well, it could be a magical Nest-like experience, but done poorly and we have the new Clippy.

API Wishlist

I just want an actual API. Then again, I also want a pony, and I understand that I can’t always get what I want.

V1 of the API that I want is literally the current API which the Echo webapp uses, but properly documented and using an auth system that’s slightly better than “grab cookies from your logged in echo.amazon.com browser”.

I was giving the todo feature a lot of flak earlier for how useless it is, but the truth is that it is actually the most useful feature of the Echo as it stands today. Why? Because it’s currently the best option for giving general purpose commands to the Echo and taking advantage of its speech-to-text capabilities without triggering any other action. It is what most of us use to make our cool little home automation hack demos — “Alexa — todo — turn on the lights”.

But the todo feature hack is not perfect — it has a couple limitations which make demos less slick than they could be. First, the reply to “Alexa — todo XYZ” is always “XYZ added to your todo list”, which is awkward and unnecessary. Next, the Echo appears to have a list of global keywords including things like “louder” and “softer” which are interpreted as device commands and automatically cancel the todo item recognition. You can never have a todo list item like “Alexa — todo — make the lights softer” (seriously, try it!). Finally, there used to not be any way for your external integrations to provide text-to-speech voice feedback through the Echo. However the most recent release of the Echo system now includes a “say” feature which is exactly what we need provided we can figure out the right API endpoint for it.

V2 of the API I want is basically a tweak of the V1 API that fixes enough of the annoyances to really open the floodgate of integrations. Instead of hijacking the poor todo feature, I’d propose allocating a new keyword, perhaps “try”. If “try” is the first word detected after “Alexa”, then everything after that is strictly transcribed and pushed to an API accessible endpoint. The Echo should not provide any automatic acknowledgement or feedback and it should not attempt to act on any potential keywords it hears after “try”. Instead, apps listening to the API endpoint would be responsible for providing user feedback in whatever manner appropriate, including using the “say” endpoint if necessary or adding a “play” endpoint that just plays a one-off wav or mp3 snippet embedded in the request.

Personally, I think it would be very natural to say “Alexa — try to turn on the lights”. What’s particularly nice about using “try” as a namespace for external integrations is that it implies a level of uncertainty about whether or not the action will succeed, leaving all definitive command statements available for officially supported features.

V3 of the API I want is a way of declaratively extending the grammar of Echo commands and then having that integrate with the Echo cloud magic.

How is this an improvement over the V2 API? It comes down to the inherent challenges of speech recognition systems. Current general purpose text to speech systems suck. Seriously —try spending an entire day not using your keyboard, and relying on Siri, Google, or any other transcription interface to execute every search you need to. Good luck getting above a 50% correctness rate, especially if you say anything remotely esoteric.

But if general purpose text to speech systems suck so much, how does the Echo get to toddler-level accuracy? Turns out rather than trying to guess the single best transcription, the magical cloud makes multiple transcription guesses and acts on the best guess which happens to map to a real command based on its internal grammar.

I didn’t really have a visceral appreciation of the importance of this until I started working on my own todo-based light control hack. At first I was just using the single displayed transcription result and getting maybe 30% accuracy. But after poking around a bit, I discovered that the todo endpoint also returned a list of “N best guess” transcriptions. When I switched to iterating through these and acting on the first one that matched my light control grammar, I got closer to 80% accuracy. The difference between 30% and 80% accuracy is the difference between making YouTube demo videos and actually having a usable feature which lets me stop pawing at physical light switches like some sort of neanderthal.

Exactly how to design this sort of extensible grammar API is an open question for which I have not seen any commercially robust solutions. But that doesn’t mean they don’t exist — in fact, if the V2 API is released, I’d expect a dozen or more different projects to sprout up, all experimenting with different ways of defining and integrating modular plugins that act on the command stream. Give them a year to bake, take the best one, and adopt it as the V3 API.

OK Google, What’s Next?

I’m really quite pleased with my Amazon Echo.

It works exceptionally well, and I use it every day. Is it worth the $300 I paid for it on eBay? Meh — probably not to anyone other than me. Is it worth the $200 list price? Probably not right now, but maybe after it gains more of the features on my wishlist or has an API to let 3rd party devs fill in the gaps. Is it worth the $99 prime price — oh heck yeah, and a perfect answer to the “I don’t know you very well but I need to get you a gift” problem.

Do I think the Echo will succeed long-term? Eh, maybe. If the team behind it really embraces the fact that ambient voice control provides not just a new UX, but also a platform for a whole new class of applications, then yeah — I’d be willing to call the Echo the new iPhone.

But until that happens, the Echo is in a precarious position — the only unique thing it really has going for it is an exceptionally good microphone, and even that advantage is just a Kickstarter away from disappearing. All it takes is one enterprising individual to glue the guts of a Polycom conference room speakerphone to the guts of a stock Android phone, mash up Tasker and OpenMic+, write a handful of integrations with the most common services/apps/devices, expose an API, and shove all of it into a cute package.

If Google themselves decide to release a Nexus Home Automation Hub with high quality ambient voice recognition, OK Google / Google Now level of intelligence, integration with the Google app/service ecosystem, and some basic Chromecast and media player capabilities, I’d switch in a heartbeat. Apple could pull off something similar, though they probably are too busy designing a 6.5" iPhone7++ to really experiment with something like this.

So until something better comes along, I’m just going to sit on my couch and boss around my cloud-based minion without lifting a finger.

Welcome to the future.

--

--