10 Years On: How Much Can You Do With Siri?
Voice recognition is good, but it’s what Siri does with it that counts
Last year, back when we were allowed to do this, I went to a restaurant with a group of friends. I forget where. There were burgers on the table. Maybe served on a piece of wood or in a paper bag. They weren’t on plates anyway. It all feels like another world now. During the meal, as words and opinions and pieces of burger flew through the air, someone said something like “seriously” or “series” or maybe even Simon. There was a muffled ping from a wrist or pocket. “I’m sorry,” Siri said, “I didn’t quite catch that.”
“I have found two uses for Siri,” my friend Dan said, “setting a timer for cooking eggs and playing the music from Frozen.” Dan has a young child, which is why he needs to be playing music from Frozen at all times. “And I eat a lot of eggs,” he added.
From time to time over the last year, I’ve thought about Siri. Aside from eggs and Frozen, how much can you actually do with it?
After the meal, I decide to try using Siri more. Immediately I get a sense of what this will entail when I remember I need to send an email when I get home. Usually, I’d write a reminder for myself, but we’re in a Siri world now.
But there’s a problem. I’m in a tube station and there are other people around. Do I really want to talk to Siri in front of them? I’ll seem mad or pretentious. Maybe, even worse, it’ll seem like attention-seeking; speaking aloud in a public space. The tube is for silent contemplation of the adverts and floor. There are, literally, unspoken rules: you don’t talk to yourself, to strangers, or to your phone unless there’s a person on the other end. Could I pretend that I’m talking to a friend of mine who just happens to be called Siri?
I walk to the end of the platform, as far away from everyone as I can get, turn into the corner, and whisper. “Hey Siri, remind me to send an email when I get home.”
There’s a ping. Siri thinks for a few seconds. “Sorry,” it replies, “there’s no home address for Home. If you’d like to add one you can do so from the contacts app.”
At this point, I discover my contacts are a mess. I’ve always half-known this, but it’s never been a problem before. I have multiple entries for the same person or old phone numbers for people who have moved jobs. When I search for, say, Dan, I get multiple entries: Dan, Dan (work), Dan (old), Dan (?). Details are split across services: some in iCloud, some in Google, some synced to my work Exchange account. It is an embarrassing mess. Sometimes, after people quit their jobs at work, and remove the work email from their phone they find they lose their personal contacts too. Phone numbers of cousins and plumbers had all been saved to work servers without them knowing. I realize I’m no better than they are. And I run those work servers.
In my contacts, I find an entry called “Home”, with my parent’s home phone number. When I said “get home,” Siri assumed I meant this, rather than to my actual home. I rename this to “Mum and Dad” and try Siri again.
“I don’t know your home address,” Siri says, “If you’d like to add it, you can do that from the Contacts app.” On my screen, the contacts entry pops up, showing, I note, my home address. Siri may be powered by AI, but it’s very picky AI.
Back home, I download all my contacts from Exchange, iCloud, and Google and merge them into one file. Each system generates a slightly different export, but with a bit of messing around, I tidy them up and upload them. Unfortunately, I make a rookie error. Excel automatically formats large numbers into exponential notation. My beautifully curated mobile numbers become nonsense values: 7.8E+09. Siri can’t use that to dial anyone.
I’m struck by the challenges we face in the 21st Century. When we wrote phone numbers in spiral-bound address books, we didn’t have to worry the pages would interpret the phone number as integers and round them to avoid floating point issues. In 2021, our lives may be more luxurious, but there are more minor inconveniences.
Once I’ve corrected all the phone numbers and loaded them back into my phone, I notice that there are extra fields I can complete. Birthdays, home addresses, partners’ names, and photos. My desire for neatness kicks in and I start adding missing data. I’m halfway through downloading profile images from LinkedIn and Facebook and adding them to my contacts when I remember this all started because I needed to send an email.
“Hey Siri,” I say, “remind me to send an email when I get home.”
Success! Siri creates a new to-do item, looks up my address, and tags it to the item so that it will trigger when I arrive at those coordinates. This is a pyrrhic victory as I am already at home. But still, I classify it as a breakthrough. Up until now, I’ve been adding things to a to-do list and then looking at the list like a Neanderthal. Now, my phone reminds me to do tasks when I am in a position to do them. Yes, I could have manually added a location to a reminder, but it’s fiddly. Siri has improved my life.
Unfortunately, my to-do list is now split between the native Apple Reminders app and my third-party to-do list app. For the purposes of getting Siri to work, I decide to use the tools Apple wants me to. I spend the rest of the evening migrating my to-dos into Reminders. So far, using Siri has resulted in more copying data between systems than I expected.
Like seemingly all technology, we have the US military to thank for Siri. In 2003, the Defense Advanced Research Projects Agency (DARPA), ran a project called CALO (Cognitive Assistant that Learns and Organizes) to build an AI digital assistant that could reason, reflect and learn. Primarily this was a research project but it also produced a prototype of Siri. Siri was initially released as a standalone app for iPhone, with plans to port it to other platforms, but in 2010 Apple bought the company and integrated it into the iPhone 4S. Supposedly the S in 4S stands for Siri.
Up until now, like Dan, I’ve only used Siri for one thing: setting timers. It’s faster than opening the timer app and spinning the numbers. A quick “set a timer for three minutes” and I’m done. Thousands of cloud servers and millions of hours of development time to ensure my eggs are soft-boiled. The biggest barrier to using Siri, I find, is identifying good tasks.
“How many steps have I walked today?” I ask, assuming that as the data is stored in the Apple Health app, Siri can access it.
“I can’t answer that on your iPhone, but you can find it in the Activity app,” Siri says. “Open the Activity app?”
I ask Siri to open the Activity app.
“I can’t do that… you don’t have the Activity app installed.”
I’ve never heard of the Activity app, so I ask Siri to install it. It turns out it’s an Apple Watch app. I don’t have an Apple Watch.
“Hey Siri, open the Health app,” I try. Siri opens the Health app, and on the screen is the number of steps I’ve walked. I can get the answer but Siri can’t access it directly and instead prompts me to open an app I don’t have on a device I don’t own. While I’m prepared to switch to-do lists for the purposes of making Siri work better, I draw the line at spending £800 on a new device.
Apple has an accessibility system called “voice control” which really can do everything on your iPhone. It produces an overlay of everything tappable on and lets you scroll up and down just with your voice. I could use this, but this isn’t Siri. What I want to do is speak to my phone at a higher level of abstraction, not step through every individual button I want to press.
The web of information you need to identify good Siri commands is huge. To make some seemingly simple commands work, I find myself puzzling through not just software architecture but also corporate structures and acquisitions.
A few years ago, I bought a smart thermostat to turn on heating with my phone. It uses machine learning to set the heating to the temperature I wanted last week, but I’ve largely managed to turn that off. So: can I get Siri to turn on my heating?
My thermostat is made by Nest Labs, acquired by Google in 2014. Google creates connected products, branded as Google Smart Home, which is how Nest is certified. Apple meanwhile provides HomeKit, a set of standards for smart homes that Siri can communicate with. Unfortunately, Google has not made Nest products compatible with HomeKit and Apple has not integrated Siri with Google Smart Home. Two tech behemoths at an impasse, waiting for the other to blink.
I am stuck. I have the Nest app on my phone, which I can open with Siri, but I can’t get Siri to then tap the button within the app to turn on the heating.
Siri, though, is just software that issues commands to other software. And I’m a software engineer. I can make things respond to commands. Or, more accurately: I can install software someone else has written and fiddle around with json files until it works.
And this is how I find Homebridge, an open-source node library that acts as a bridge between two commercial empires — it allows HomeKit to talk to Nest APIs. A sort of digital emissary. You install it on your computer, start it up, and after accepting a few scary warnings about it being unsupported, you’re up and running.
Next, you need to extract a series of security keys from the Nest website that Google really doesn’t want you to have. For the adventurous, you can get the instructions online, but let’s just say as I was going through the Request-Response Headers in the development console in Chrome and copying values to the issueToken field in a config.json file I thought to myself: well, this just works. These are definitely some instructions I could send to elderly relatives.
But I got there in the end. I connected everything, and now I can say to Siri: “Turn the heating up two degrees”, and it works.
As long as my computer is on.
One thing Siri does very well is phone people. At least it does now I’ve tidied up my contacts. I start getting ambitious. “Make a FaceTime Audio call to my mum on speaker,” I say. And it works.
There’s a joy in stringing multiple commands together. Not just a phone call, not just a FaceTime call, a FaceTime Audio call, to my mum, and do it on speaker. I create a mental scoring system and find myself chaining commands to get high scores in my own made-up game.
There’s a moment of tension when you issue a longer command. At one point I ask Siri to phone Rob and it starts phoning Bob, an ex-colleague I haven’t spoken to for years (and don’t really get on with). At other times, Siri completely misunderstands what I’ve said and I have to start again. Bring out the world’s tiniest violins for this first-world problem, but man it’s annoying when you have to repeat the command from the start. Disproportionately annoying. I find myself thinking, “Oh fine I won’t have the heating on then,” rather than repeat myself. But when Siri gets it right and is faster than doing it manually, I feel elated. I’m proud of myself. And all I’ve done is spoken aloud a basic instruction.
So far the things I’ve tried to do with Siri are minor actions. Turn on the heating. Phone someone. Set a timer. I try getting more adventurous and using Siri to send and read messages.
Perhaps my friends are more clumsy and more pedantic than average, but a typical message from my friends looks like this:
“Do you know what time you’re going to be getting help?”
Siri only reads the last message in the conversation, which means I find it repeatedly reciting one-word messages like “you’re” or “their” or “bloody autocorrect.”
I think of Her, the 2013 film where Joaquin Phoenix falls in love with an artificial assistant. There is not much risk of me falling in love with Siri. For a start, talking to Siri is nothing like talking to a human. Its intonation is disjointed: “I don’t know how to do that… Simon,” Siri says when confused. It never feels like talking to a human. There’s a gap as Siri processes your words and when the reply comes it is clearly a response from an API request. The things I ask Siri are functional and repetitive. Maybe I’m a heartless psychopath, but I do not feel the need to thank Siri when it starts a timer. When trying new commands, more often than not, I’m just annoyed.
An example. I’m going for a meeting in Reading, and I ask Siri how long it’ll take to get there by train.
“Sorry,” Siri says, “I can’t provide transit schedules or timetables, but if you need directions via public transit from your current location, I may be able to help.”
I thought that was what I asked, so I try rewording it. “How do I get from here to Reading by train?” I ask.
“Where would you like to go?” Siri asks.
“Reading,” I say. Reading pops up on the screen.
“Hmm…” Siri says, “let’s try that again. Where would you like to go?”
I wonder if Siri is getting confused because Reading (the place in Berkshire) is the same as reading (the present participle of the verb “to read”). This is what my life has become: thinking about homonyms and present participles to try to get my phone to tell me how long a train journey will take.
At other times, Siri is like an overly officious PA. When cooking dinner one day I say, “Hey Siri, set a timer for, um,” I check the packet, “Fifteen minu-”
“For how long would you like the timer set?” Siri interrupts me.
Jeez, Siri, give me a minute, I was just getting to that.
Speaking to Siri is not conversational, at least not the sort of conversations I have. What you’re doing is issuing instructions with parameters: “Add a meeting to my calendar on 4th December from 10 until 11 called Try to Get HomeBridge working”. That’s not a conversation.
Siri breaks this down to:
- Type: Calendar
- Action: Create entry
- Start date/time: 2021–12–04 10:00:00
- End date/time: 2021–12–04 11:00:00
- Title: Try to Get HomeBridge working
In a sense, voice assistants are a throwback to the command-line. Before we had graphical interfaces to click and drag, people typed words into black boxes on their screens. The command prompt is still around, but these days it's mainly used by developers, people with shelves of O’Reilly books, and film producers when showing hackers.
The command line consists of commands with additional instructions called “parameters”. If you want to make a new folder, for example, on Windows, you type:
mkdir "Folder name"
You can’t make empty folders on an iPhone, but if you could, I imagine I’d say to Siri: “Make a folder called ‘Folder name’”. I’m essentially doing the same thing out loud as I’d do with a command prompt.
A graphical interface works in a different way. You don’t have to remember special commands like mkdir or know which words Siri is listening out for. The screen shows you what’s available, like a menu in a restaurant, and you click on what you want.
Over the last few days, I’ve found myself thinking things like: “I wonder if Siri can move apps around on my home screen.” But there’s no way of knowing, without trying a few different commands. In fact, even after trying a few, I don’t know for sure that Siri can’t do it, or whether I just didn’t find the right magic words. With a well-designed graphical interface, you don’t have to think like this because the buttons show what’s possible. Screens aid discoverability. Siri hides it away.
James J Gibson coined a term for this in his 1966 book The Senses Considered as Perceptual Systems: affordances. It means the way an item shows what you can do with it. A push door has a metal plate to show you can push it, a pull door has a handle to show you can pull it (in theory). Graphical interfaces offer affordances: buttons you can press, handles you can drag, bars you can scroll. Voice interfaces don’t offer affordances. Like the command prompt, you’ve left with a flashing pointer to indicate you need to enter something, but only a small subset of all the possible things you enter will do anything.
“Hey Siri, Knock-knock.”
“Knock-knock,” Siri says. I started saying knock-knock as if I were going to tell a joke, but now Siri has taken over.
“Who’s there?” I say.
“Wooden shoe,” Siri says.
“Wooden shoe who?” I say.
“Go on,” Siri says. I must have spoken too quickly, and Siri missed my answer.
“Wooden shoe who?” I try again.
“I don’t know, who?” Siri says.
I start over, and this time get to the punchline. “Wooden shoe like to know.” Siri does a robotic giggle. It pops into my mind that iPhones cost over a thousand bucks apiece.
There are a lot of Easter Eggs hidden in Siri. Hundreds. The effort Apple has put into custom responses for jokey questions hints at how Siri is used. People play with Siri or show it to their friends to amuse each other. They don’t use it for productive work. Siri is a toy.
Talking to computers is a new way of interacting with them, and for all the Amazon Alexas and Google Homes that have been sold, we’re still not used to using our voices. As a child, when I first used a computer, I remember struggling to double-click the unfamiliar device I was holding at an unnatural angle. This was part of the reason Microsoft included Minesweeper and Solitaire with Windows 95: to teach people to use a mouse. We thought we were finding little bombs, but we were practicing precise left and right clicks.
I’d like to say Easter Eggs in Siri teach us how to interact by voice. But they don’t. The jokes don’t show interaction principles. Saying “knock knock” doesn’t teach us what sort of things Siri can respond to, it’s a gimmick to amuse us. These aren’t principles, they’re lists. Siri giving a cheeky answer when you ask its age is tricking you into thinking it’s smarter than it is.
Ultimately, Siri is only as good as the systems it hands commands to. We call it an assistant, but really it is a voice recognition and brokerage service. It turns spoken words into text, matches keywords against an index of commands, and triggers other services. Even if Siri correctly transcribes and parses what you say, there may not be a service to pass the request to and so the request fails.
Siri is generally good at converting speech to text. At least for people without strong accents. A decade ago, voice recognition hit a plateau, but Siri (and Amazon and Google) has broken through that. If I don’t mumble, Siri transcribes my words as well as a person. Its understanding of those words, however, is less good. Simple commands confuse it and I can see the basic logic playing out. If you say the words “knock-knock” you’re going to get a knock-knock joke, even if you say “don’t tell me a knock-knock joke”.
I play a game with myself where I try to find the simplest way of triggering tasks. For example, rather than saying:
“Make a facetime audio call to my mum on speaker”
I can say:
“Facetime audio. Mum. Speaker.” And get the same response. The logic is rudimentary: listening for specific words and triggering actions based on them. There is no understanding of the sentence around them.
Siri has been built into iPhones since 2011, but when I ask friends, few use it. If I use it in their presence they laugh awkwardly or look at me like I’ve offered a ham sandwich to a vegetarian. If strangers are present they shuffle uncomfortably and try to make it known they don’t endorse this sort of behavior. Beyond the other issues I’ve mentioned, the social costs of using Siri are high.
However, now that I’ve got used to Siri, I find more opportunities for it. You need to put in work, not only to identify use cases but to form the habit of using them. Even now, as something of a connoisseur of Siri, I get caught out.
On the way back from seeing my parents my mum asks me to text her when I’m home to let her know I got back safely.
“Hey Siri, text my mum when I get home,” I say.
Siri pings acknowledgment. “Here’s your message to mum, shall I send it?” The message, of course, says: “When I get home”.