Apple Vision Pro — Fundamental Capabilities & Usecases

Varun Torka
Technology & Product
7 min readApr 9, 2024
Photo by Igor Omilaev on Unsplash

Let’s face it, the Apple Vision Pro has been dissected by tech gurus far and wide. You’ve likely seen countless reviews and analyses already. So for you, the reader, I am not sure if any of this will be new. This is about me tickling my own curiosity and forming my point of view on the matter.

Since this is Apple, to make it worth their time, and considering the investment & long gestation period needed for a device like this to market, they must see a potential total addressable market (TAM) of > $10B, if not >$100B. To have a TAM of this magnitude, we need to assume they’re thinking of this as a multi-purpose computing platform. A computing platform with multiple broad use cases, and not just narrow specializations. More like an iPhone rather than more focused products like the Apple Watch or AirPods.

Features & Capabilities

So, the big question is: how do we interact with a brand-new computing platform built for a 3D world? Let’s dive into the features and capabilities that are crucial for this interface.

1. Pointing & Clicking

For a 3D space, we’re still figuring out the right set of controls. Point & click is fundamental to most graphical interfaces, and it’d be even more important in the 3D space with objects at a distance.

The way the pointer works can be in several ways though. It can be through something I’m doing with my own body, it could just be through something I hold. Apple seems to have gone with the former more difficult approach, leveraging a user’s eyes & fingers. Oculus with the latter.

Apple has gone for a very new approach wherein your eyes themselves are the pointers, and the click is through the index finger & thumb touching each other. While I haven’t had the opportunity to try this myself, it sounds very cool & very sci-fi. It might be quite intuitive as well. But not having tried it I can’t help but think that it is also a bit unnatural. My hands constantly do stuff without my eyes monitoring everything that needs to happen. I’d rather the controls be my hands & my eyes be involved in planning, a few steps ahead. Here, the eyes seem attached to the specific things that need doing, which is very different from how we normally operate.

Compared to a 2D screen, there’s a dimension of depth here. How do we handle this? I think one good & intuitive way could be to have the pointer be like a laser gun from the hand. This is as natural as it gets, and we allow the full dexterity of the hand to be used.

2. Providing High fidelity input

More complex manipulations, like typing, drawing should be done as close to their real-world analogies as possible. Currently, it seems these apps need you to use a virtual/real keyboard. The virtual keyboard doesn’t seem like an intuitive experience at all given there’s no tactile feeling. And in Apple’s case, you have to use your eyes to operate it. Why not just write with a stylus as you would with pencil on a paper?

The perfect device here seems to be a pointer. No bigger than a pencil, with a few different buttons on it for different actions. That coupled with voice recognition feels it would be a powerful combination.

3. The graphical coordinate system

How is different content arranged around the user? There are a few different options here.

It could be a sphere around the user or multiple concentric spheres. This feels a bit unsettling, but this could be the default mode for applications that focus on productivity.

Another option is for the windows to be attached to a coordinate system independent of the user. This would make it appear as if the user is in a space through which they can walk through. Given Apple’s focus on AR, this is more probably where they want to go. They seem to have done an amazing job of ‘freezing’ the window in the coordinate space, so they feel real. This also allows one to seamlessly interposition the windows on top of the real world, which can be the bedrock for quite powerful AR experiences.

4. Future: Object detection & recognition on steroids, Mapping the real world

This is not part of the current release but I feel is just around the corner. If the OS itself can map out the world & objects, and allow apps to superimpose information on them, that would unlock a huge number of additional use cases. This will put AR on steroids. Some use cases I discuss below.

Hardware

The computing hardware itself (CPUs, GPUS, memory, networking/wifi, power/battery, etc), doesn’t fundamentally change in what they do, but it does need to support the glasses form factor. It needs to shrink in size & support the specific kind of (graphics processing, etc) processing required. Then there are the fast 3-D cameras required to capture the world, and the sensors required to track head movement, eye movement, and hand movement. And of course, the amazing display, which can simulate a photorealistic version of the world within an inch of the eye.

What kind of use cases could come in?

The final form of AR, where we imagine it like a pair of glasses you’re wearing, which seamlessly blends real work with information laid on top, is pretty compelling. But its also quite far away. What kind of realistic use cases can come in before then, while the actual hardware is this massive headgear you have to wear?

1. Entertainment is a no-brainer

Already noted by everyone. With the amazing technical achievement, this is, you can get access to a giant screen anywhere anytime, with privacy. Now you can binge-watch Netflix in your airplane seat feeling you’re sitting in a theatre, without disturbing your neighbor.

But this is just the start, where we’re still thinking of existing form factors. What’s interesting to imagine is all the new form factors this can enable. A great unlock would be a way to map any real space — your home, a mall — into a database, so that it can be interposed by artificially generated content. Allowing you to transfer the new artificially generated content without having to worry about hitting anything or hurting yourself. Imagine converting your own house to feel like you’re walking in ancient ruins, or a rainforest, or an escape room adventure. Or the nearby mall into an Indiana Jones style treasure hunt adventure.

It is said that the usage in pornography is an early indicator of the success of any new technology. And this device seems purpose-built for this. Imagine avatars. Prostitution with real-time AR with what Gen AI is capable of. I don’t condone it, but it seems inevitable.

But entertainment is not just ‘entertainment’. It also means making the dreary, well less dreary. Regular tasks like vacuum cleaning, could be converted into a game & made more fun. Exercise could be an adventure. A normal run can be a Temple Run.

2. Education

Will people use these devices to consume educational content? Some kinds of subjects are better suited to take advantage of this form factor.

The theoretical subjects — physics, maths, economics — have less to gain. But any subject which can take advantage of superimposing information on real-world objects would be able to be learned on steroids. Imagine you’re trying to learn French, and an app gives you English & French translations of every text you see and also the French word for every object you see. Imagine trying to learn botany, and you could go into a garden with the names of every plant next to the plant. Imagine trying to learn architecture and you could walk around

3. What about Productivity use cases?

Well, a lot of it would depend on the fidelity of inputs it would support. For example — A phone supports inputs via 2 thumbs. That’s not sufficient for creation work. But with the correct input system, the use cases become interesting.

The workstation offered by IT companies already costs more than a thousand dollars, with multiple monitors & Apple Macbooks. So replacing this with a headset, which offers even larger & more flexible workstations, with complete privacy, seems like a reasonable sell. Of course, the keyboard will need to stay to enable the higher fidelity input system.

The attraction in 3D modeling is obvious, whether it is for animation, architecture, or engineering systems. You could visualize & edit the model in space, magnify & shrink it at will.

Mechanical tasks are where there seem to be the greatest unlock seems possible. People doing fieldwork, agricultural work, assembly line work, and warehouse work. The AR system can map out the world to proactively help guide people where to go next. An image analysis system can proactively detect & highlight any defects they need to address, etc.

4. Communication

The new Facetime experience is getting rave reviews. I have to try to believe it, but currently, I don’t feel like it’s going to be such a transformative experience over video-calling. Also, nothing’s going to replace in-person meetings for personal relationships.

But corporate meetings are another matter. I can see how brainstorming could be much more effective if remote people can still feel they are in the same room with a whiteboard.

A note on EyeSight

Why try to recreate the user’s eyes on the outside? It can only mean Apple wants users who are wearing the Vision Pro to physically interact with others who are not wearing the Pro. Maybe to enable meetings where some attendees are in person and some are remote?

Whatever it is, it feels like a bit impractical & far-fetched at the moment. We’ll have to wait & see.

--

--

Varun Torka
Technology & Product

Technology, Philosophy, Creative Fiction & Non-Fiction, Product, Management (in no particular order)