Q&A on ARKit

All the best questions I was asked & you might be interested in

My previous 2 posts…

Why is ARKit better than the alternatives?

Why Apple’s glasses won’t include ARKit

resulted in a bunch of interesting followup questions being asked in the comments. I collated them with my responses here together here. Hopefully will illuminate some further ARKit aspects while I work on my next post around building AR-Native apps today for Mobile AR.

I agree on your assessment that tracking is in a commoditisation phase. It’s time to move onto the hard problems of interaction that have been plaguing AR developers for some time now. This is where I’m aiming to work :-)

Thanks Philip. Interaction is a really fascinating and challenging space to attack. The “stack” isn’t really in place yet, and choosing where to work exposes you to platform shifts underneath you. AR Interaction still needs:

  • input (hardware & modes & multi-modal AIs)
  • GUI and OS
  • Apps and use-cases

I feel its a bit early to commit to the GUI & Apps side of interactions, but its definitely a great place to start to learn. From my experience (and my wife’s she’s an AR interaction designer currently running the Design Lab at Adobe figuring out the future of immersive design) I think that the 2nd biggest problem (after input) is around how do you layout content on a 3D scene which you don’t control and have never seen before…. and make it legible (black text? what if the wall is black, or its night time. How do you stop it occluding something important? How big should the fonts be? Should it be 2D or 3D?) etc etc

anyway… that’s what I’d focus on, with the expectation that a real market for this stuff is either

  • right now if you are working at a platform company (Hololens, Apple, Magic Leap etc)
  • a couple of years away if you want to build apps on these platforms

Apple purchased Metaio in May 2015 — an AR company with impressive tracking technology and who were the main competition to Vuforia at the time. It seems likely that Metaio tech has also contributed to ARKit?

It doesn’t seem like a coincidence that at the WWDC that ARKit was announced Apple also released Metal 2 with enhanced support for vision and inferencing acceleration — along with several associated layered libraries. Do you think that part of the robustness of ARKit is due to Apple leveraging Metal 2 for GPU acceleration in their implementation?

If so — then the corresponding GPU API in Android — Vulkan — should start offering vision and inferencing compute acceleration ASAP?

Hint — I think so! :) Apple has the engineering advantage of a single vendor-controlled stack — but just like Windows, Android leverages the industry to reach 85% global market share. To enable the multi-vendor Android ecosystem to effectively deploy accelerated AR — stable vision and inferencing acceleration APIs will be key. The same applies to the Web stack.

Re Apple & Metaio… I don’t have any hard insider confirmation on this, but I think the Metaio codebase would have helped with the plane detection and probably helped with the mapping/relocalization pieces of the visual tracker. FlyBy had by far the best Inertial tracker on the market, and it’s this piece that makes ARKit magic (instant stereo convergence & metric scale in particular). I think Metaios tech will add more value as Apple builds out a fuller SLAM stack on top of the VIO.

Re Metal and how that helps… we ported a lot of our Dekko code into GPGPU on Apples early systems (first to do so AFAIK) and spent ridiculous amounts of time on performance. Here’s what we learned:

  • with a great inertial system & good calibration, the visual system doesn’t really have to do a lot of processing. Doing some feature detection & bundle adjustment on the GPU helped, but most of the benefit came from hw accelerated 3D matrix manipulation (ie general 3D math functions). So for tracking there isn’t too much benefit
  • The real benefit came when dealing with 3D reconstruction in particular (dense reconstruction on device is still state-of-the-art today) and also when relocalizing on a big tracking map (ie when tracking over large areas)

So re your comments re Android having 85% of the market share & being able to win long term with great acceleration…. I think what matters is:

  1. Clock Synched sensor hubs have to be in all devices (this is pretty much the case for new devices now)
  2. IMU calibration/modelling needs to be “included by default” by the OEM. This definitely ads to the BOM cost & manufacturing time of the device. Still an industry challenge IMO
  3. hardware acceleration will then takeover and drive the AR UX value from here on. Specifically real-time dense mono 3D reconstrcution and robust wide-area localization are first. 3D semantic scene understanding after that.
  4. power consumption will be an even bigger issue for HMDs than it is for phones, due to weight sensitivity of the wearable affecting battery size. Whether this means a new type of CVGPU chip (Movidius?) or specialized ASICs (Hololens HPU) or complex CPU/GPU systems (Qualcom 835) is hard to predict right now.

BTW — re Vulkan and WebVR/WebAR in particular, I would love to talk some more with you about ways that this could come to market. I am talking with a number of OEMs and CV research teams that would look favorably on a “chromebook” style HMD which boots straight into WebVR/AR without an “OS”. Biggest/only gap in that plan is the ability to run robust consumer grade tracking & reconstruction at native performance…

I agree with your point at the end about “AR won’t be a 4x6 grid of app icons.” I think we’ll need some way of seeing many apps together at the same time; this may result from new OS UI structures, but I’m currently bullish on new “user agents” for the web (essentially, AR browsers, but for real!)

The UI for AR question fascinates me. Of all the “challenges” facing AR that one feels like The Great Unknown. My feeling is it will

  1. be mostly determined by whatever the ultimate input system ends up being
  2. be highly skeuomorphic (at least in the beginning). We found we needed to do this at Dekko & Samsung for people to figure out intuitively how to use the virtual affordances. Eg the AR “Phone” “app” probably will look like a physical phone that you pickup & dial
  3. be pretty basic & much closer to what we already understand (ie multi-touch on phones) than what most of us would like it to be

Am wondering what your thoughts are regarding the future of/timeline for mixed reality and things like tLiDAR integration. I’m absolutely impatient for multiple in take sensors with higher resolutions streaming into user friendly annotative and annotatable systems sans decimation.

I don’t think LIDAR will find it’s way into Mobile/HMD AR due to the power it needs. Putting aside the hardware, I think what you are after is the system being able to provide the app developer / content creator with some 3D awareness of the real world scene you are looking at. And to do that in a high-resolution way (this is referred to as a Dense 3D reconstruction, but instead of pixels we use Voxels which are just a 3D pixel. Generally Dense means there is a voxel for each cubic inch or so (or a voxel for every pixel on the 2D camera sensor). Higher resolution is possible but for outdoor or room scale scenes even 1 inch per side voxels is high enough res for a great UX).

To provide this 3D awareness, the system needs to provide both the “Geometry” and “Semantics” of the world ie both the shape of your real couch, and some labeling that it is in fact a couch.

Dense 3D reconstructions are possible today using 2 cameras (1 regular RGB and one active Depth camera) which is how Tango & Hololens do it. However both systems only provide a simplified geometry (Hololens voxels are about 6–12 inches in size). Being able to do a dense reconstruction using just a single RGB camera is still in research and at least 12 months away from being in products. The processing power required is beyond what even the best mobile devices can handle today.

Semantic understanding of 3D scenes is also an area of active research and also 12 months or so away from products (you’ll see demos before then)

Apple’s (probable) early ‘dominance’ of smarthpone-based AR doesn’t really affect non-iOS users (hence the vast majority of smartphone users). They don’t have high-end AR now and won’t for a while. So what does ARKit mean for the Android ecosystem? Will Google have to change course and offer a less costly version of Tango (without fisheye lense)?

I honestly think the real value & benefit of ARKit is that it will show the OEMs (ie Android & HMDs) what is possible when the hardware is there, along with really helping developers move up the learning curve. Honestly an app which pushes ARKit to the limit still isn’t a very good AR app due to handheld form factor, no 3D reconstruction, poor input UX etc etc. But only thanks to ARKit are devs really able to understand what else is needed beyond great tracking.

Whether it’s Tango-lite or something else, the Android OEMs are all looking for their own versions of ARKit. It will be a war for developers to use their Sdks

Although I’m closely looking to Mobile AR/SLAM about a year (experimenting with Tango, Kudan, and ARKit), your detailed explanations, insights and great story telling are incredibly helpful to connect dots together!

A couple questions:

  • Does ARKit use second camera on 7s device?
  • What exact algorithms is being used in ARKit/Tango to escape VIO drifting due to moving objects?

Re your questions:

  • I don’t believe it uses the second camera [edit: an Apple source privately said it does NOT use the second camera, at least in this version], but it might use it for an extra accuracy when creating the initial 3D map. FlyBy’s original system didn’t use the 2nd camera. As well as using IMU Dead Reckoning to initialize, the 2nd camera could provide a second estimate from stereo. But as ARKit works on mono RGB phones, it would need a special feature to support this on stereo hardware, so IMO its unlikely, but not impossible
  • I don’t completely understand what you mean by “drifting due to moving objects”. I’ll assume you meant how does the system keep a virtual object registered in place, while the scene itself contains moving objects (eg people walking around in the background etc). It does this through a couple of methods:
  1. The optical system can be confused whether the device itself is moving if the scene is moving eg when you are sitting on a stationary train beside another train, when movement starts you are not immediately sure if its your train or the other one that is moving. This problem is eliminated by checking with the IMU whether it is reporting any acceleration. If not, then you are still & the scene is moving.
  2. usually only parts of the scene are moving eg people move but the ground & buildings are still. The system will be observing points in teh scene, some of which are observed repeatedly in the same place, others which aren’t where they were a moment ago. It gives each point a “reliability score” to measure whether that point can be used to reliably calculate the pose of the device. What happens is that only the points that are stable end up being used, and anything in the scene that is moving (or gets occluded by a moving obejct) gets ignored.

[Depth Cameras]…also don’t work outdoors as the background Infrared scatter from sunlight washes out the IR from the depth camera.

Not necessarily. See here:

“Even though we’re not sending a huge amount of photons, at short time scales, we’re sending a lot more energy to that spot than the energy sent by the sun”


Agree it’s not 100% clear cut. If you throw enough power at the emitter you can get it to work in decent sunlight. Also can use Stereo IR receivers (like Intel Realsense did with one of theirs) and get decent outdoor results at range. The issue ultimately comes down to the trade-offs between power, BOM cost, and statistically how many outdoor scenarios does it work to consumer quality… so far those tradeoffs haven’t made it attractive to put a depth camera into a consumer device

Great article! Finally someone pointed out that computer vision is going to replace depth sensors. This is not far away given the rise of synthetic data.

I’m not 100% sure that active IR sensors are going away for depth. They handle some things RGB just can’t such as dark rooms, monochrome surfaces, depth when the device is still (eg device sitting on a desk, can get 3D map of face), and IR might help with some bio tracking use-cases. These are all kind-of corner cases, but there may be enough utility in them to justify BOM and space costs to include a depth sensor in the phone, but just not firing it very often to save on battery.

I’m not really sure. But I am sure that the primary use of depth cameras today (capturing a 3D model of objects or spaces) will be done by mono rgb and a bit of cnn

We are about to start developing an AR app for the construction industry integrating Revit models with AR. Would love your view on whether to go the ARkit or Tango route?

I’d advise to make the decision based on:

  • do you expect your customers to download an app (go ARKit) or buy a solution from you (go Tango & sell App+Phone+support+training etc)?
  • Does your content need to interact with the 3D world (ie the building plans in 3D). Only Tango supports that today. May be some time (a year or so) before ARKit des

Not knowning anything much about your company, I’d make a simple recommendation to go with Tango, aim to sell a solution to the customer (ie for hundreds or thousands of dollars per deal) and figure out your product in the market asap. If you can’t deliver thousands of $$ in benefit you may not have a really disruptive product. When ARKit supports the tech you need, then port to that, and maybe offer a lower cost “self-service” app to increase your reach into the market.

What do you see as the eventual solution for syncing reference frames across multiple HMDs/devices? Many apps will require that many users are able to look at and interact with shared virtual objects, which would require global coordinates for the devices involved. Will this be enabled by optical sensors + complex processing alone, or will something else be required to get the relative positioning of various devices?

On a related note, do you see additional sensor platforms being integrated into future HMDs such as lidar/radar? Or is the future entirely camera-based? It seems to me that there are some insurmountable obstacles to an optical-only approach (i.e. reduced performance in dimly-lit or night environments).

Re the Multi-user challenge. We built solutions for this at Dekko and again at Samsung, so I have a good handle on what’s needed. Bascially the systems need to be using Absolute coordinates (ie Lat/Long) not relative coordinates (start at 0,0,0), and then the systems just share their position with each other (exactly like a multiplayer online game does — the sharing & updating positions in real-time has long ago been solved by the game industry). The hard part is getting accurate absolute coordinates via localization. There’s no single solution to this problem, and no one has solved it afaik outside the military who have access to more accurate GPS and IMUs than consumers. Bascially the system will turn on, start itself at 0,0,0 and get a GPS reading (which is only accruate to 10–20m) and then start to converge the VIO and GPS systems (similar to how the Optical & Inertial systems converge — but will take longer). As well as this, the system will probably grab a camera frame or two and attempt to localize using teh skyline or landmarks against a GIS database or some deep learning driven matching of the scene, the result of which could also improve the accuracy of your absolute position. Ultimately we need pixel level location. I can see how it can be done today, but making it “instant” is a big challenge.

Honestly I don’t see lidar/radar being used on HMDs due to power & weight. They will stick to VIO on device. Maybe some devices will have a depth cam used lightly. GPU will be used to support these sensors for localization & 3D reconstruction, which are themselves supported by Cloud based 3D maps and training data.

If the calibration is the key to this, why did Apple go to the trouble of doing it for the 6S/SE? Given the concurrent release of CoreML, is it not possible there is a larger ML component to ARKit (which is why its tied to a processor generation rather than a production date?)

re supporting the various devices, I’m not sure how much of the work would apply from one device to the next, but I know it wouldn’t mean starting from scratch each time.

Apple isn’t using Metal or CoreML for the VIO. There’s no need as the CPU processing requirements are quite low due to the accuracy of the inertial system. Apple also wants to keep GPU headroom available for games/apps (and future 3D reconstruction & rendering features)

Great article. I explored this a little also and I think that some of the tech is related to Apple’s “Focus Pixels.” I wrote about it here:

I had heard of this concept in theory but didn’t know if it had ever been applied. I can see how it could help with depth estimation, but don’t know how accurate it would be (I assume the accuracy error would vary depending on how far away the object is from the camera), nor how robust (eg does the device need to be perfectly still, so the pixels the depth is calculated for doesn’t change?). Being able to apply techniques like this to the tracker is certainly an advantage for the companies who can best integrate software and hardware.

I agree with the spectrum of value propositions you methodically point out, from technology to human design. I agree with your leaning toward form factor and communications as critical for mass adoption.

A couple additional high value possibilities to come :

  • Making people look up again, away from devices and toward each other. There is a potential to reverse the systemic devolution and disruption of all manner of “traditional” social interaction today. That said, folks may blissfully stream feeds and analytics over all. * Our collective device & social apps addiction is like a digital opioid crisis. That said, I fully admit I love tech. today and can’t live without it. I can, however, turn it off. Unlike our kids.
  • The mid term future holds the potential of displaying and interacting with fantasy in our every day lives, when we want that. And I want that. People love to escape. Escapism takes many forms and there lies the opportunity to create a remarkable, imaginative, wonder-filled world beyond the likes that Walt Disney himself could ever imagine(actually, he probably could .. and did!). In this regard, I feel that we are on the brink of a revolution which greatly expand our collective creative imaginations. That part, will be a gift to our kids.

Anyhow, Matt, we can try to game out when iterations of part stack and full stack appear and from whom and with what ultimate strategy and market in mind. But personally I am super excited that developers are finally starting to cut their teeth prototyping future uses and experience scenarios. Everyone’s work counts in this regard.

Thanks John, those are two really good points. Re the first point, I think about (but have no answers) to whether eyes-up will mean all that much when we’re still distracted. We could be talking face to face but if I’m daydreaming then I may as well not be there (and it’s even worse for you as you *think* I’m paying attention). It’s a really interesting opportunity that AR gives us to bring some proactive design to those interactions, and I’m looking forward to the experiments that start to emerge

Your second point is one I love, and completely agree with your thinking. It’s honestly not how most of the AR industry thinks (too much talk about information overlays) but the potential of AR to bring a touch of magic to our worlds (a tiny elf that lives in the corner of my office behind the plants?) or getting my e-mail literally delivered by a Hogwarts Owl is just going to make our lives (literally) wonderful. It’s something we tried to make happen with our Dekko Monkey, but we were too early for the tech to support it. I always think of Roger Rabbit as a better cinematic preview of what AR could be than Terminator or Iron Man. I can’t wait. I love that the technology has come to the point where it’s now an enabler to creators

What about “compute tethering” like Apple does with the Watch currently? CPU/GPU processing for ML and vision on iPhone, fused with 6D sensor data from the Glasses. The Glasses is merely a rendering target and display.

Wireless data throughput is an issue, so the Glasses do need its own GPU for final rendering a compressed data stream of what to render from the iPhone.

I think your points on what is possible are all realistic from a technology point of view. I definitely think the iPhone will for a long time be the ‘external processor’ as you suggest. I just don’t see (and this is just my opinion, not based on any special knowledge) that Apple will go the way of packing the hmd with tech & sensors on day 1 from a design point of view. I think they’ll do the absolute minimum technically so that the customer is gradually taken on a journey from just wearing a “dumb” lens through to full AR. All the AR Input/UX/GUI/App design challenges are still a long way from being solved, even if the underlying systems technology can deliver the form factor and tracking etc.

Note from a tech point of view & how much can be offloaded to the ‘remote’ phone/box, I believe from a tps-per-watt pov the pose calculations could be done locally on the hmd with a custom asic, but rendering would need to be done on the phone, and drawing the screen fast enough would be a real bandwidth & throughput challenge with any local wireless solutions today. We saw with Nitero that this itself is a hard problem… though a cable would address that (99% sure that Magic Leap v1 will be something along those lines)

Digital+physical will be great, but in the meantime here are 3 advantages of simpler tabletop 3D: 1) in social/multiuser, it’s very easy and satisfying to navigate the same physical space together, 2) zooming/rotating are powerful ways to change context and are simply leaning/turning in VR/AR, 3) the tabletop world can react to you in ways that make you feel like you’re part of it (imagine Disney’s Magic Bench as a Magic Table). Granted, most software doesn’t prove these advantages yet.

those are really good points. We experienced the benefits of social AR with our tabletop game. I don’t believe (at least from our testing) that just zooming in & controlling the camera are enough for tabletop ar. It takes some effort to hold the device up, and people are lazy. Am on the fence on that one. One other way that tabletop can work is if the phone is not only used as the lens/window to look through, but also as a controller So you need to move the controller close to some content in order to “push” it as part of gameplay. Or similar things. All the above IMO “technically” fall under “interacting with the physical world”. I’ll cover these in more detail in my next post

In 1994 Paul Milgram and Fumio Kishino defined a mixed reality as “…anywhere between the extrema of the virtuality continuum.”

Mixed reality was also defined in Doug A. Bowman’s 2004 textbook 3D User Interfaces: Theory and Practice as “… a continuum including both VEs and AR. An environment’s position on the continuum indicates the level of “virtuality” in the environment…” Bowman was hired by Apple in early 2016.

While today MS has certainly co-opted it for marketing purposes, the term itself represents a broad and extremely interesting concept worth reading about. Words are just words, and it’s the least important aspect in the end, but I thought I’d share this with you anyway in case you didn’t know.

Getting into definitions of AR etc is usually a topic I try to avoid, and definitely avoid getting pedantic about with people until they can conceptualize the actual user experience. I’m glad more & more people are becoming aware of Milgrams continuum (and eventually also my friend Ron Azuma’s definition of AR). Those are certainly the “correct” definitions IMO. I’m finding people are gradually figuring it out (even some marketing departments — google is talking about spatial & immersive computing & referring to google glass now as a HUD!). Doug Bowman & my SV partner Mark Bilinghurst are decades old friends, both world experts in similar domains, so you could say I’ve got a good sense of what Doug’s been up to broadly, though of course I avoided specifics in this post, and afaik he hasn’t shared any hard details of what he worked on at Apple.