Siri’s Quest for Humanity
by Steven Brykman
The trend with Apple products has always been to make things smaller and faster with greater storage capacity. This holds true of practically any technology product put out by any company.
But of all the Apple products I can think of, the only one which has gotten consistently and significantly larger over time (other than desktop displays) is the iPhone. It’s the one time Apple really got it wrong. iPhones started out the smallest they would ever be, and have only gotten bigger since.
I’m not saying Apple erred by going bigger. Everyone I know who bought a new iPhone says the same thing, “I love it. I can’t believe I went with such a small phone for so long. Now every time I look at an iPhone 5 I think, ‘Oh my God. Look at how small that is!’ It seems like this little teeny phone now. I can barely type on that thing. But now I don’t even need my iPad anymore.”
Regardless, what really matters isn’t screen size. Samsung’s not going to gain an edge by putting another screen on it. In the long run, what will make all the difference is something that doesn’t require a screen at all. Nor even a user-interface. I’m talking about vocal interaction with a “virtual assistant” like Siri or Cortana.
A key concern of technology has always been finding ways to improve how we interact with it: from keyboard to mouse/trackpad to Google goggles to Leap device to voice interaction. Of these, vocal interaction is the most obvious, intuitive, and natural — the most similar to how we interact with people in the real-world. It may also be the most represented interaction modality in sci-fi — everything from Douglas Adams’ “share and enjoy” to Kubrick/Clarke’s “I know that you and Frank were planning to disconnect me, and I’m afraid that is something I cannot allow to happen.” Just as more and more daily objects are becoming internet-enabled, it’s inevitable that one day it will seem perfectly normal to hold a conversation with your dishwasher: “What do you mean, I’m not using enough detergent?”
There are a number of ways to achieve more ‘human’ interactions with our devices. Firstly, questions that require culling internet data will need: 1) a means by which all our randomly stored/displayed data (aka: the internet) can be somehow organized and repackaged into a usable standard format (beyond Wikipedia entries and Google search results) along with a more intelligent means of gathering this information. 2) the ability to present this collected data back to the user in natural language.
We are already witnessing a trend towards improving information gathering and repackaging. Companies like Crimson Hexagon treat the web like a “mass focus group,” providing actionable information to corporations based upon opinions expressed on social media (though admittedly, even social media offers a certain level of formatting). The ultimate goal is to be able to ask our devices any question about anything and receive an actual answer — not just a web-search — phrased in perfect English (or whichever language the user speaks).
Some of this technology is already here. Ask Siri for a picture of President Obama and Siri delivers. Ask Siri how old President Obama is and again, Siri scores, answering, “Barack Obama is 53.” Impressive. But follow the image search up with the more ambiguous, “How old is he?” And Siri fumbles, searching the web for “how old is he.” There’s no inclusion of ‘Obama’ in the follow-up because Siri is not smart enough to consider context in this particular instance.
With the exception of specific lines of questioning, the conversation with Siri ends with one question. Siri somehow does well with directions, deriving appropriate context when asked, “Where is the nearest Salvation Army?” followed by, “Give me directions.” Plus Siri and Google both do better with weather: “Google Search’s “smart, conversational” style allows users to ask follow-up questions to “What’s the weather like?” such as “How about this weekend?” and the system understands the context. Siri does the same.
But ask Siri a seemingly simple question like, “What bands are playing in Boston tonight,” and it assumes you’re asking it to identify a song: “Hmmm. I’m not familiar with that tune.” Ask it again and for whatever reason it seems to grasp the meaning a little better: “OK, I found this on the web for ‘What bands are playing in Boston tonight’. But the results it returns are unsatisfactory, referring the user to websites (bostonbands.com, northshoretonight.com) rather than listing the names of the bands. Not so smart after all. Or is it just that the source isn’t there? That that sort of information has yet to be packaged-up into a readable format?
Cross-platform, the results are inconsistent. Ask Cortana (Windows’ voice assistant) for the tallest building in the world and it responds, “the Burj Khalifa.” Considerably better than Siri’s answer: “Here’s some information” (though Siri provides the correct result visually). Meanwhile, Google Now provides the best response of all, speaking the top three results along with their respective heights.
When it comes to the formats in which an answer is delivered, again, we find much cross-platform variation. Ask Google Now, “How do you say, ‘hello’ in Spanish?” and it tells you the answer out loud. As it should. Say “repeat” and it knows to repeat the word. Add, “What about French?” and it reads the French version aloud. Siri, on the other hand simply conducts a web search and displays the result, but with no audio. Adding “What about French?” brings up a Wikipedia entry — for French’s mustard!
In a Siri-Cortana battle from back in April, Pete Pachal found that “When asked about restaurants, both shot back a list of results quickly. It was only Cortana who understood follow-up questions, though — telling me how far one of the results was, doing her (sic) best to find the menu and calling the place — while Siri had no idea what I was talking about. Point Cortana.”
The current formula that describes our contextual voice interactions with devices, then, is this: replace any ambiguous nouns with the more-specific noun from the previous question. Simplistic, but better than nothing.
The larger picture involves the use of personal context: incorporating a user’s preferences, activity, locations, past searches, etc. to produce an even more rich, rewarding exchange between man and machine. The question “Which movies are playing near me today?” yields a list of movies, but the follow-up question, “Which ones will I like?” naturally remains unanswerable (Though Google Now derives the context correctly, converting the question to “Which movies will I like?”).
Eventually, our devices will incorporate a wide range of data into their responses, by recalling which restaurants we visited and reviewed favorably in the past, for instance, it will know which to recommend in the future. Same with more transient events. “Which bands are playing tonight?” implies ‘which bands are playing tonight that I will enjoy based upon my existing musical preferences (pulled from my iTunes library, Spotify playlists, Pandora stations, etc.).
This guy claims Cortana already does this, that it “learns as it goes along, it takes information from you…it learns your interests. What teams you follow, what sports you follow…”
One wonders if the technology will become so prescient that eventually we’ll just let our devices plan our evenings: make the reservations, buy the tickets, even order the food. As a father of three kids with no time on his hands, frankly, I’m looking forward to the day this scenario is possible: “Hey Siri, set me up with dinner and a movie for Friday night: Chinese with a high hot & sour soup rating and an Apatow comedy with an emotional subtext. No farcical Lampoon crap. Also, don’t make me walk more than a half-mile. Daddy’s dogs are barking.“
(this story originally appeared in an edited form in Enterprise Apps Today)