Dawn of the AR Cloud

Why Google’s Cloud Anchors validate the importance of AR Cloud services and why being first doesn’t mean all that much.

Since Apple’s WWDC conference this time last year, which fired the starting gun for consumer AR with the launch of ARKit, we’ve seen every big platform announce an AR strategy: Google’s ARCore; Facebook’s camera platform; Amazon Sumerian; and Microsoft continuing to build out its Mixed Reality ecosystem. We’ve also seen thousands of developers experiment with AR Apps but very little uptake with consumers. Back in Sept 2017, I predicted that AR Apps will struggle for engagement without the AR Cloud, and this has certainly turned out to be the case. However we are now witnessing the dawn of the cloud services which will unlock compelling capabilities for AR developers, but only if cloud providers get *their* UX right. It’s not about being first to market, but first to achieving a consumer grade UX.

Does anyone remember AR before ARKit & ARCore? It technically worked, but the UX was clunky. You needed a printed marker or to hold & move the phone carefully to get started, then it worked pretty well. Nice demo videos were made showing the final working experience which wow’ed people. The result…. zero uptake. Solving the technical problem (even if quite a hard technical problem) turned out to be very different to achieving a UX that consumers could use. It wasn’t until ARKit was launched that a “just works” UX for basic AR was available (and this was 10 years after Mobile SLAM was invented in the Oxford Active Vision Lab which Victor Prisacariu, my 6D.ai cofounder, leads).

We are entering a similar time with the AR Cloud. The term came about in a September 2017 conversation between Ori Inbar and I as a way to describe a set of computer vision infrastructure problems that needed to be solved in order for AR Apps to become compelling. After a number of early startups saw the value in the term (and more importantly the value of solving these problems), we are now seeing the largest AR platforms start to adopt this language in recognition of the problems being critically important. I’m hearing solid rumors Google won’t be the last $multi-billion company to adopt AR Cloud language in 2018.

Multi-player AR (and AR Cloud features in general) has the same challenges as basic 6DoF AR. unless the UX is nailed, early enthusiast developers will have fun building & making demo videos, but users won’t be bothered to use it. I’ve built multi-player AR systems several times over the last 10 years, and worked with UX designers on my teams to user test the SLAM aspects of the UX quite extensively. It wasn’t that hard to figure out what the UX needed to deliver:

  1. Recognize that people won’t jump through hoops. The app shouldn’t require asking Players 2, 3, 4 etc to “first come & stand next to me” or “type in some info”. Synching SLAM systems needs to just work from wherever the users are standing when they want to join i.e. from any relative angles or distance between players.
  2. Eliminate or minimize “pre-scanning” especially if the user doesn’t understand why it’s needed, or get given feedback on whether they are doing it right
  3. Once the systems have synchronized (ie relocalized into a shared set of world coordinates) the content needs to have accurate alignment. This means both systems agree that a common virtual x,y,z point matches exactly the same point in the real world. Generally being a couple of cm off between devices is OK in terms of user perception. However when (eventually) occlusion meshes are shared, any alignment errors are very noticable as content is “clipped” just before it passes behind the physical object. It’s important to note that the underlying ARCore and ARKit trackers are only accurate to about 3–5cm, so getting better alignment than that is currently impossible for any multiplayer relocalizer system.
  4. The user shouldn’t have to wait. Synching coordinate systems should be instant and take zero clicks. Ideally instant means a fraction of a second, but as any mobile app designer will tell you, users will be patient up to 2–3 seconds before feeling like the system is too slow.
  5. The multiplayer experience should work cross-platform, and the UX should be consistent across devices.
  6. Data stewardship matters. Stewardship refers to “the careful and responsible management of something entrusted to one’s care” and this is the word we are using at 6D.ai when we think about AR Cloud data. Users are entrusting it to our care. This is a bigger & bigger issue as people start to understand that saved data can be used for things that weren’t explained up-front, or that it can be hacked and used criminally. However people also are generally OK with the bargain that “I’ll share some data if I get a benefit in return”. Problems arise when companies are misleading or incompetent wrt this bargain, rather than transparent.

So… putting aside all the application-level aspects of a multi-player UI (such as the lobby buttons & selector list to choose to join the game), the SLAM-synch piece isn’t just a checkbox, it’s a UX in and of itself. If that UX doesn’t deliver on “just works” then users won’t even bother to get to the app-level a second time. They will try once out of curiosity, though….Which means that market observers shouldn’t pay attention to AR app downloads or registered users, but to repeat usage.

Enabling developers to build engaging AR Apps is where AR Cloud companies need to focus, by solving the hard technical problems to enable AR-First apps that are native to AR. This means (as I have learnt painfully several times) that UX comes first. Even though we are a deep-tech computer vision company, the UX of the way those computer vision systems works is what matters, not whether they work at all.

What are Google’s “Cloud Anchors”?

At Google I/O last week, Google announced an update to ARCore 1.2, which included a handful of new features, the most notable was support for AR Multiplayer via technology called “Cloud Anchors”. Everyone registered the headlines… “Multiplayer” and “IOS support” but there wasn’t any discussion (or even any real demo) of the Cloud Anchor UX (there were some nice looking demos of multiplayer running *after* the setup UX was completed… funny that).

A nice image from Google showing how 2 devices can see the same AR content above the vase

So how does Google do AR Mutiplayer? (Note: I don’t have any special insight into the algorithms Google is using, however I’ve built AR multiplayer systems in the past, and at a high level the steps are clear & well known. It’s deep in the relocalization algorithms themselves where the advances are taking place). These are the high level steps involved:

  • Player 1 starts an app, then they click a button saying they want to “host” a game. This triggers an upload to the server of some “visual image data” to Google’s cloud. Google are carefully non-specific re what exactly is uploaded, but referring to visual data, doing the processing in the cloud (vs on device) and emphasizing that it is discarded over time strongly implies there is some personally identifying data (ie part of a photo from your camera) uploaded automatically to Google.
  • Player 1 has to also manually enter a “room number” to identify the anchor (map patch) amongst potentially several small anchors in the same wifi SSID. This is probably due to a limit on the physical coverage area of the anchor being reasonably small
  • Google’s cloud then processes this uploaded visual data to create an “Anchor” which is a type of 3D point cloud (or more accurately a SLAM map which is a point cloud & some recognizable descriptions of those points, plus maybe even some keyframe images), which is stored in Google’s cloud.
  • Player 2 then comes along, and asks to join a game, or “resolve” an Anchor. Player 1 needs to tell Player 2 the room number (I assume this will need to be verbally) so the correct anchor is downloaded.
  • This triggers an upload of more visual descriptor data from Player 2’s phone to Google, which then processes that data, and then tries to match the two data sets (maps). This process of matching point clouds is called “relocalization”, and it’s easy to do if the point clouds are very similar to each other & small (ie captured from the same place, and covering a small area), but gets very hard to do when they are different and large (ie Player 1 & Player 2 are standing apart from each other, and the physical area to be supported is large). I wrote a long post in Feb explaining why this is hard.
  • If the resolving/relocalizing is successful, then a “transform matrix” is passed back to Player 2. This is just a way to tell Player 2's phone exactly where it is in 3D space in relation to Player 1, so Player 2's graphics can be rendered appropriately to look “right” and in the same physical spot as Player 1 sees it.
  • There’s no information or guidance given to Player 2 re how to improve the quality or likelihood of a successful “resolve”.

So far so good. There’s nothing really surprising or technically impressive here. In fact, Google has been able to do this for several years via Tango’s ADF files (which are a form of Anchor), though it was a manual process.

What was surprising:

  1. Google waited so long, as they had the tech to do this for years as part of Tango (as with ARCore 6dof position tracking itself). Considering how the product doesn’t appear to advance the science, this appears to be an effort by Google to get back on the marketing front-foot after being caught off- guard last year with Apple’s ARKit announcement. It’s interesting to note there are rumors that Peer to peer Multiplayer will be announced by Apple at WWDC this year, so this was necessary marketing on Google’s part. Also interesting to hear that Apple’s multi-player will likely be a peer-to-peer system to avoid dealing with privacy (and the cloud) at all. This means a whole other set of UX compromises that Apple has to manage.
  2. The approach Google took re doing the processing in the cloud vs on-device. This obviously creates a potential privacy problem around uploading visual data. Then they discard the data (not immediately, but after up-to 7 days). This means no content is really persistent & thus the multiplayer UX is constrained to a handful of use-cases. The bigger problem is that it also means that the anchors don’t improve over time as they are used and thus the UX doesn’t improve as more people use it. It also means anchors covering large physical areas aren’t supported.

One thing that really stood out to me, having worked on these problems for so many years, is that Google didn’t talk about the Multiplayer UX at all. Only that the technology for Multiplayer exists. Demonstrating an understanding of why devs & end users have struggled with multi-player would have allowed them to show how those UX problems have been solved vs the technology problems.

So how does Google’s Cloud Anchor system measure up?

It’s important to point out that I can’t claim to be completely impartial wrt this write up as my startup 6D.ai is building a similar service .

After spending a couple of days using Google’s Cloud Anchors, and talking with other experts who did their own testing, we were able to get a good handle on the UX, and the limits of the system

The first UX challenge, and by far the biggest one for anyone working on building solutions to these problems, is that there is no pre-existing map data for the scene, and thus “Player 1” needs to pre-scan the scene to gather image data in order to build an anchor.

Here’s what we learned:

First we did only a small “pre-scan” as this helps estimate how far away you can be from a previously observed point to check on robustness. We were disappointed to find that the system worked about as well as any other current relocalizer, and Player 2 could “resolve” or relocalize successfully from about 30 degrees and up to 3 ft away from a pre-scanned area (and pre-scanned areas seemed to max out about 1/3 room size). Effectively this means Player 2 has to come & stand beside Player 1 to start. Or player 1 has to spend quite some time pre-scaning the scene from many angles (like a minute or more walking around) before starting a game/app.
We were overly optimistic initially and tested using a checkerboard pattern, which is notoriously difficult for all SLAM systems due to the repetitive patterns. No surprise that this pretty much “broke” AR Kit itself and also Cloud Anchors (and our system too).
When Player 1 spent time pre-scanning, AND player 2 also spent some time pre-scanning before attempting to resolve, the best results were achieved, and this was what we would have hoped to see….accurate and easy multiplayer. It’s just quite difficult for both players to go through the hassle of getting to this result. We were computer vision experts, and we had to figure it out for ourselves, the system didn’t give any guidance (esp for Player 2).

Even after this experimenting, there’s still a couple of things we don’t know:
- exactly what data is passed up, and why does it need to be discarded? Google is carefully vague here. A journalist friend described it as “twisting themselves into a privacy pretzel”.
- what about China or private premises (like a military base)? Google cloud services are unavailable in China, and Cloud Anchors seem to depend completely on accessing Google’s cloud (ie no offline mode).

I’m also curious around what this “100% cloud” approach portends for the future direction of AR Core, as persistence & occlusion & semantics move closer towards public release in the next couple of years.

The Bigger Picture — Privacy and AR Cloud data…

When it comes to Google’s Cloud Anchors, visual image data is sent up to Google’s servers. It’s a reasonably safe assumption that this can potentially be reverse engineered back into personally identifiable images (Google was carefully vague in their description, so I’m assuming that’s because if it was truly anonymous they would have said so clearly)

This is the source image data which should never leave the phone, and never be saved to the phone or saved in memory. This is the type of personally identifiable visual image data that you *don’t* want to be saved or recoverable from the AR Cloud provider. Google says they do not upload the video frames, but descriptors of feature points (see below) which could be reverse engineered into an image

For the future of the AR Cloud’s ability to deliver persistence & relocalization, visual image data should never leave the phone, and in fact never even be stored on the phone. My opinion is that all the necessary processing should be executed on-device in real-time. With the users permission, all that should be uploaded is the post-processed sparse point map & feature descriptors which cannot be reverse engineered. An interesting challenge that we (and others) are working through is that as devices develop the ability to capture, aggregate & save dense point clouds, meshes and photorealistic textures, there is more & more value in the product the more “recognizable” the captured data is. We believe this will require new semantic approaches to 3D data segmentation & spatial identification, in order to give users appropriate levels of control over their data, and is an area our Oxford research group is exploring.

Here’s what a sparse point map looks like for the scene above (note our system selects semi-random sparse points, not geometric corners & edges, which cannot be meshed into a recognizable geometric space)

This is the point cloud which we save based on the office image data above

The second piece of the puzzle is the “feature descriptors” which are saved by us & also Google in the cloud. Google has previously said that the Tango ADF files, which ARCore is based on, can have their visual feature descriptors reverse engineered with deep learning back into a human-recognizable image (From Tango’s ADF documentation — “it is in principle possible to write an algorithm that can reconstruct a viewable image”). Note I have no idea if ARCore changed the Anchor spec from Tango’s ADF enough to change this fact, but Google has been clear that ARCore is based upon Tango, and changing the feature descriptor data structure is a pretty fundamental change to the algorithm.

These are the feature descriptors generated for each point in the point cloud. This is as far as 6D’s cloud hosted data can be reverse-engineered, based on applying the latest science available today along with massive compute resources.

This is critical because for AR content to be truly persistent, there needs to be a persistent cloud-hosted data model of the real-world. And the only way to achieve this commercially is for end-users to know that that description of the real world is private and anonymous. Additionally I believe access to the cloud data should be restricted by requiring the user to be physically standing in the place the data mathematically describes, before applying the map to the application.

This reality regarding AR Cloud data creates a structural market problem for all of today’s major AR platform companies, as Google and Facebook’s (and others) business models are built on applying the data they collect to better serve you ads. The platforms such as Apple & Microsoft are silos, so won’t offer a cross-platform solution, and also won’t prioritize cloud solutions where a proprietary on-device P2P solution is possible.

The one factor that I had underestimated is that large developers & partners clearly understand the value of the data generated by their apps, and they do not want to give that data away to a big platform for them to monetize. They either want to bring everything in house (like Niantic is doing) or work with a smaller partner who can deliver technology parity with the big platforms (no small ask) and who also can guarantee privacy and business model alignment. AR is seen as too important to give away the data foundations. This is a structural market advantage that AR Cloud startups have, and is an encouraging sign for our forseeable future.

As ARKit announced the dawn of AR last year, we believe Google’s Cloud Anchors are announcing the dawn of the AR Cloud. AR Apps will become far more engaging, but only if AR Cloud providers deliver a “just works” computer vision UX and address some challenging & unique privacy problems.

How do we approach spatial anchors at 6D.ai?

At 6D.ai we are thinking slightly differently than Google (and everyone else to be honest). We believe that persistence is foundational, and you can’t have persistence without treating privacy seriously. And to treat privacy seriously it means that personally identifying information cannot leave the device (unless explicitly allowed by the user). This creates a much harder technical problem to solve, as it means building & searching a large SLAM map on device, and in real-time. This is technically easy-ish to do with small maps/anchors, but very very hard to do with large maps. Where small means 1/2 a room, and large means bigger than a big house.

Fortunately we have the top AR research group from the Oxford Active Vision Lab behind 6D.ai, and we built our system on a next-generation relocalizer algorithm, taking advantage of some as-yet unpublished research. The goal for all of this was to get multi-player and persistent AR as close as possible to a “just works” user experience, where nothing needs to be explained, and an end-users intuition about how the AR content should behave is correct. There’s no special “Host/Resolve” steps or “manually enter a room number” or “trust us, your data is personally identifiable, so we throw it away…. but we need it for the system to work…”

Here’s what’s special about how 6D.ai supports maps/anchors for multi-player and persistence:

  • we do all the processing on-device and in real-time
  • anchors/maps are built and extended in the background while the app is running. Updates from all users are merged into a single anchor, vastly improving the coverage of the space.
  • the anchors/maps have no personally identifying information and are stored permanently in our cloud. Everytime any 6D powered app uses that physical space, the anchor grows and improves the coverage of that space. This minimizes and eventually eliminates any need to pre-scan a space
  • anchors are available to all apps. Every user benefits from every other user of a 6D app.
  • we label anchors using non-personally identifying meta-data. The anchors are large enough to be non-specific, and we support multiple non-overlapping “patches” for anchors that have non-contiguous coverage of a physical space. All this is invisible to the end user (and the app developer).
  • Meta-data is either an encrypted (on-device) SSID or a large geo-fenced GPS area
  • Our map data cannot be reverse engineered into a human readable visual image
  • Map data is only available to an app that is running on a device physically inside the location where the map was built. We check both coarse (WiFi or GPS) location and using precise visual feature and sensor data to match whether you are in the physical location.
  • Our anchors greatly benefit from Cloud storage and merging, but there is no dependency on the cloud for the UX to work. Unlike google’s system, we can work offline, or in a peer-to-peer environment, or a private/secure environment (or China)
This is an example from last weeks 6D SDK beta release where we show that as the map grows over time, Player 2 can automatically and instantly relocalize or resolve from any angle, with no extra scanning involved. When there is no pre-existing map data in a space (ie cold start, or the situation for Cloud Anchors) then our current performance wrt angles/distance is about the same as Google’s today… however we are instant, private, persistent, cover large areas & the UX is invisible & automatic…. and we have a swathe of improvements coming to improve cold start robustness (esp angles) in the coming 4–6 weeks

6D.ai’s relocalization algorithm, to the best of our knowledge from our Oxford Research Lab, cannot have its map or feature descriptors reverse engineered into identifiable visual data based on any research available today. We will continue to ensure our data structures are updated as new research comes to light. Even if you hacked into our system and figured out our algorithm then applied huge compute resources to reverse engineer the data, the best you would get would be the image way above of the circular feature descriptors (which correspond to the desk scene also above). In addition, the source wifi network ID is encrypted on the phone and only the encrypted info is stored, so the hacker at best gets the point cloud & these feature descriptors but has no way of telling where on earth it corresponds to. This also means that we can’t determine the images used to construct the 3D cloud data, even if we wanted to or were requested to by a govt etc, as the 6D cloud never sees the source images or has any way to reconstruct them.

So how do we feel about Google launching a product aimed at one of the areas we are tackling?

I’ll be honest, when I heard that Google would be launching multiplayer, I had some fears (founder paranoia). It was the first-time some of the hypotheses we had founded 6D.ai upon were going to be tested. Would we have better technology, better focus on developers needs and a better understanding of the desired end-user experience? I’m pleased to say on all those fronts we are looking good.

another metaphor

Obviously, Google has distribution power & it’s a default option for Android devs. But the biggest market problem right now is that developers don’t know about AR multi-player at all, and Google has an ability to invest in educating the market that no startup can beat. We expect that once devs start testing multi-player and start investing real time & money into multi-player apps, then they will see how Cloud Anchors come up short, and see how 6D.ai solves their problems.

6D.ai is building advanced APIs for the AR Cloud. Our closed beta is rapidly taking on new developers.