Multiplayer AR — why it’s quite hard

I’ve written a bit about what makes a great smartphone AR App, and why ARKit and ARCore have solved an incredibly hard technical problem (robust 6dof inside-out tracking) and created platforms for AR to eventually reach mainstream use (still a couple of years away for broad adoption, but lots of large niches for apps today IMO). So developers are now working on climbing the learning curve from fart apps to useful apps (though my 8 year old son thinks the fart app is quite useful, thankyou). The one feature I get more people asking about than any other is “multi-player”. The term multi-player is really a misnomer, as what we are referring to is the ability to share your AR experience with someone else, or many someone-elses in real-time. So calling it “multi-user”, “Sharing AR”, “Social AR”, “AR Communication” are just as good terms, but multi-player seems to stick right now, probably because most of the 3D AR tools come from gaming backgrounds & that’s the term gamers use. Note that you can do multi-player asynchronously, but that’s like playing chess with a pen-pal. As an aside, I can’t wait for newer tools to come to AR that align with workflows of more traditional design disciplines (architects, product designers, UX designers etc) as I think that will drive a huge boost to the utility of AR apps. But that’s for another post…

I personally believe that AR won’t really affect all our day-to-day lives until AR lets us communicate and share in new and engaging ways that have never been possible before. This type of communication needs real-time multi-player. Personally I think the gaming centric term multi-player restricts our thinking about how important these capabilities really are.

Multi-player AR has been possible for years (this is the multiplayer AR app we built at Dekko in 2011) but the relocalization UX has always been a huge obstacle

So if multi-player is the main feature people are asking for, why don’t we have it? The answer, like so much AR functionality, means diving into the computer vision technology that makes AR possible (we’ll also need low latency networking, maintainance of consistent world models, sharing audio + video, and collaborative interaction metaphors as well, but this post focusses on the computer vision challenges, which aren’t really solved yet). Multi-player AR today is somewhat like 6dof positional tracking was a few years ago. It’s not that hard to do in a crude way, but the resulting UX hurdles are too high for consumers. Getting a consumer grade multi-player UX turns out to be a hard technical problem. There are a bunch of technologies that go into enabling multi-player, but the one to pay attention to is “relocalization”. I’ll explain what that means later. The other non-intuitive aspect of multi-player is that it needs some infrastructure in the cloud in order to work “properly”. This piece is the part of the ARCloud that I’ve referred to in past posts.

What I think about when I think about the ARCloud

The term ARCloud has really caught on since my Super Ventures partner Ori and I wrote two blogs on the topic. We’ve seen it applied to a great number of “cloudy” ideas that have some AR angle to them. We certainly don’t own the term, but here’s a little more clarification about what I mean when I refer to the ARCloud, as distinct from regular Cloud services.

If you think about all the various pieces of an app that sit in the cloud, I tend to split “the cloud” horizontally and separate those services into things that are “nice to have” in the top half, and “must have” in the bottom half. The nice to have things are generally related to app and content and make it easy to build and manage apps and users.

Your AR Apps today without an ARCloud connection are like having a mobile phone that can only play Snake

The bottom half of the cloud is for me, the interesting part. An AR system, by its very nature, is too big for a device. The world is too large to fit, it would be like trying to fit all of google maps and the rest of the web on your phone (or HMD). The key insight is that if you want your AR app to be able to share the experience, or work well (ie with awareness of the 3D world it exists in) in any location, the app just can’t even work at all without access to these cloud services. They are as important as the Operating System API’s that let your app talk to the network drivers, or the touch screen, or disk access. AR systems need an operating system that partially lives on device, and partially lives in the cloud. Network/Cloud data services are as critical to AR apps as the network is to making mobile phone calls. Think back before smartphones… your old Nokia mobile phone without the network could still be a calculator and you could play snake, but its usefulness was pretty limited. The network and ARCloud are going to be just as essential to AR apps. I believe we will come to view today’s ARKit/ARCore apps as the equivalent to just having offline “Nokia Snake” vs a network connected phone.

The ARCloud can be stratafied into 2 layers.. the nice to have cloudy pieces that help apps, and the must-have pieces without which apps don’t even work at all.

How does Multi-player AR work?

To get multiplayer to work, we need a few things setup:

  • first the two devices need to know their position relative to each other. Technically this means that they need to share a common coordinate system, and know each other’s coordinates at every video frame. The coordinate system can either be a world system (eg latitude/longitude) or they might just agree to each use the coordinates from the first device to get started. Recall that each device when it starts generally just says “wherever I am right now is my (0,0,0) coordinates”, and it tracks movement from there. My (0,0,0) is physically in a different place to your (0,0,0). In order to convert myself into your coordinates, I need to relocalize myself into your SLAM map and get my pose in your coordinates, then adjust my map accordingly. The SLAM map is all the stored data that lets me track where I am.
  • then we need to ensure for every frame, each of us knows where the other is. Each device has it’s own tracker which constantly updates the pose each frame. So for multi-player we need to broadcast that pose to all the other players in the game. This needs a network connection of some type, either Peer-2-peer, or via a cloud service. Often there will also be some aspect of pose-prediction & smoothing going on, to account for any minor network glitches.
  • after that we would expect (this isn’t mandatory, though the UX will be badly affected without it) that any 3D understanding of the world that each device has, could be shared with other devices. This means streaming some 3D Mesh and semantic information along with the pose. eg if my device has captured a nice 3D model of a room which provides physics and occlusion capabilities, when you join my game, you should be able to make use of that already captured data, and it should be updated between devices as the game proceeds.
  • lastly there are all the “normal” things needed for an online real-time multi-user application. This includes managing user permissions, the real-time state of each user (eg if I tap “shoot” in a game, then all the other users’ apps need to be updated that I have “shot”), and managing all the various shared assets. These technical features are exactly the same for AR and for non-AR apps. The main difference is that to date they’ve really only been built for games, while AR will need them for every type of app. Fortunately all of these features have been built many times over for online and mobile MMO games, and adapting them for regular non-gaming apps isn’t very hard.
Even an app like this needs the ARCloud and “MMO” infrastructure to enable the real-time interactions.

What’s the hard part?

Imagine you are in a locked windowless room and you are given a polaroid snapshot of a city sidewalk. It shows some buildings and shop names across the steet and cars, people etc. You have never been here before, it’s completely foreign to you, even the writing is foreign. Your task is to determine *exactly* where that photo was taken, with about 1cm of accuracy. You’ve got your rough latitude/longitude from GPS and only roughly know which direction you are facing, and you know GPS can be 20–40m inaccurate. All you have to go on is a pile of polaroids taken by someone else in roughly the same area recently, each marked with an exact location. This is the problem your AR system has to solve every time it is first turned on, or if it “loses tracking” by the camera being temporaly covered or if it points at something that it can’t track (a white wall, or a blue sky etc). It’s also the problem that needs to be solved if you want to join the AR game of your friend. Your photo is the live image from your device’s camera, the pile of photos is the SLAM map that you have loaded into memory (maybe copied from your friends device or one built in the past).

If you want to get a sense of how hard it is to relocalize, try playing the Geoguessr game https://www.geoguessr.com. This is very close to the same problem your AR system has to solve every time you turn it on

To illustrate the problem, lets take two extreme examples. In the first case, you find a photo in the pile that looks almost exactly like the photo you have. You can easily estimate that your photo is fractionally behind & to the left of the photo in the pile, so you now have a really accurate estimate of the position your photo was taken. This is the equivalent of asking “player 2” to go and stand right beside “player 1” when player 2 starts their game. Then it’s easy for player 2’s system to figure out where it is relative to player 1, and the systems can align their coordinates (location) and the app can run happily.

In the other example, it turns out that unbeknownst to you all the photos in your pile are taken facing roughly south, while your photo faces north. There is almost nothing in common between your photo and what’s in the pile. This is the AR equivalent of trying to play a virtual board game and player 1 is on one side of the table, and player 2 sits down on the opposite side, and tries to join the game. Apart from some parts of the table itself (which you see in reverse to what’s in the pile) it is *very* hard for the systems to synchronize their maps (relocalize).

The difference between these examples illustrates why just because someone claims they can support “multi-player” AR, it probably also means that there are some significant UX compromises that a user needs to make. In my experience building multi-player AR systems since 2012, the UX challenges of the first example (requiring people to stand side by side to start) are too hard for users to overcome. They need a lot of hand-holding and explanations, and the friction is too high. Getting a consumer-grade multi-player experience means solving the 2nd case (and more).

In addition to the 2nd case above, the photos in the pile could be from vastly different distances away, under different lighting conditions (morning v afternoon shadows are reversed) or using different camera models which affect how the image looks compared to yours (that brown wall may not be the same brown in your image as mine). You also may not even have GPS available (eg indoors), so you can’t even start with a rough idea of where you might be.

The final “fun” twist to all this, is that users get bored waiting. If the relocalization process takes more than 1–2 seconds, the user generally moves the device in some way, and you have to start all over again!

Accurate & robust relocalization (in all cases) is still one of the outstanding hard problems for AR (and robots, and autonomous cars etc).

How does Relocalization work?

So how does it actually work? How are these problems being solved today? What’s coming soon?

At it’s core, relocalization is a very specific type of search problem. You are searching through a SLAM map, which covers a physical area, to find where your device is located in the coordinates of that map. SLAM maps usually have 2 types of data in them, a sparse point-cloud of all the trackable 3D points in that space, and a whole bunch of keyframes. A keyframe is just one frame of video captured and saved as a photo every now & then as the system runs. The system decides how many keyframes to capture based on how far the device has moved since the last keyframe, and the system designer making tradeoffs for performance. More keyframes saved means more chance of finding a match when relocalizing, but takes more storage space, and means the set of keyframes takes longer to search through.

So the search process actually has 2 pieces. The first piece is as described above with the Polaroids example. You are comparing your current live camera image to the set of keyframes in the SLAM map. The second part is that your device has also instantly built a tiny set of 3D points of its own as soon as you turn it on based only on what it currently sees, and it searches through the SLAM sparse point-cloud for a match. This is like having a 3D jigsaw puzzle piece (the tiny point-cloud from your camera) and trying to find the match in a huge 3D jigsaw….where every piece is flat gray on both sides.

Here’s a simplified overview of how most of today’s SLAM systems build their SLAM map using a combination of Optical Features (sparse 3D point cloud) and a database of “keyframes”.

Due to the limited amount of time available before a user gets bored, and the modest compute power of today’s mobile devices, most of the effort in relocalization goes into reducing the size of the “search window” before having to do any type of brute-force searching through the SLAM map. Better GPS, better trackers and better sensors are all very helpful in this regard.

How is it really being done today in apps?

Poorly! There are broadly 5 ways that relocalization is being done today for inside-out tracking systems (it’s easy for outside-in, like a HTC Vive, as the external lighthouse boxes give the common coordinates to all devices that they track). These ways are:

  • rely on GPS for both devices and just use lat/long as the common coordinate system. This is simple, but the common object we both want to look at will be placed in different physical locations for each phone. Up to the amount of error in a GPS location (many meters!). This is how Pokemon Go currently supports multi-player, but because the “MMO” back-end is still quite simple, it’s actually closer to “multiple people playing the same single-player game in the same location”. This isn’t entirely accurate as once the pokemon is caught, other people can’t capture it, so there is some simple state management going on.
Here’s what happens when you rely on GPS alone for relocalization. We don’t see the object where it is “supposed” to be, and we don’t even see it in the same place on 2 different devices.
  • rely on a common physical tracking marker image (or QR code). This means we both point our phones at a marker on the table in front of us and both our apps treat the marker as the origin (0,0,0) coordinates. This means the real world and the virtal world are consistent across both phones. This works quite well, it’s just that no one will ever carry the marker around with them, so it’s a dead end for real-world use.
Here’s an app that uses a printed image that all the devices use for relocalization in order to share their coordinates
  • copy the SLAM maps between devices and ask the users to stand beside each other and have player 2 hold their phone very close to player 1. Techncially this can work quite well, however the UX is just a major problem for users to overcome. This is how we did it at Dekko for Tabletop Speed.
  • Just guess. If I start my ARkit app standing in a certain place, my app will put the origin at the start coordinates. You can come along later and start your app standing in the same place, and just hope that wherever the system sets your origin is roughly in the same physical place as my origin. It’s techncially much simper than copying SLAM maps, and the UX hurdles are about the same, and the errors across our coordinate systems aren’t too noticeable if the app design isn’t too sensitive. You just have to rely on users doing the right thing….
  • Constrain the multi-player UX to be OK with low-accuracy location and asynchronous interactions. Ingress and AR treasure-hunt type games fall into this category. Achieving high-accuracy real-time interactions is the challenge. I do believe there will always be great use-cases that rely on asynchronous multi-user interactions, and it’s the job of AR UX designers to uncover these.

It’s worth noting that all of the above solutions have existed for many years, and yet the number of real-time multi-player apps that people are using is pretty much zero… All the solutions above IMO fall into the bucket of an engineer being able to say “look it works, we do multi-player!” but end users just find it too much hassle for too little benefit.

What’s the state of the art in research (and coming soon to consumer)?

While the relocalization method described above is the most common approach, there are others that are seeing great results in the labs and should come to commercial products soon. One is using full frame neural network regression (posenet) to estimate the pose of the device. This looks like being able to get your pose accurate to about a meter or so under a wide range of conditions. Another method is to regresses the pose of the camera for each pixel in the image.

Posenet is indicative of where systems are headed

Can the relocalization problem really be solved for consumers?

Yes! In fact there have been some pretty big improvements over the last 12 months based on state-of-the-art research results. Deep learning systems are giving impressive results for reducing the search window for relocalizing in large areas, or at very wide angles to the initial user. Searching a SLAM map built from dense 3D point clouds of the scene (rather than sparse point clouds used for tracking) are also enabling new relocalization algorithms that are very robust. I’ve seen confidential systems that can relocalize from any angle at very long range in real-time on mobile hardware, and support many many users simulaneously. Assuming the results seen in research carry over into commercial grade systems, then I believe this will provide the “consumer grade” solutions we expect.

But these are still only partial solutions to fully solving relocalization for precise lat/long and for GPS denied environments, or parts of the world were no SLAM system has ever been before (cold-start), but I’ve seen demo’s that solve most of these point problems, and believe that it will just take a clever team to gradually integrate them into a complete solution. Large scale relocalization is on the verge of being primarily an engineering problem now, not a science problem.

Can’t Google or Apple just do this? Not really.

Google has demo’d a service called VPS for their discontinued Tango platform, which enabled some relocalization capabilities between devices. Sort of a shared SLAM map in the cloud. It didn’t support multi-player, but it went a ways towards solving the hard technical parts. It’s never been publicly available so I can’t say how well it worked in the real world, but the demos looked good (as all demos do). All the major AR platform companies are working on improving their relocalizers that are part of ARKit, ARCore, Hololens, Snap etc etc. This is primarily to make their tracking systems more reliable, but this work can help with multi-player also…

VPS is a good example of a cloud-hosted shared SLAM map. However it is completely tied to Google’s SLAM algorithms and data structures, and won’t be used by Apple, Microsoft or other SLAM OEMs (who would conceivably want their own systems, or partner with a neutral 3rd party).

The big problem that every major platform has with multi-player, is that at best they can enable multi-player within their eco-system. ARCore to ARCore, or ARKit to ARKit and so on. This is because for cross-platform relocalization to work, there needs to be a common SLAM map on both systems. This would mean that Apple would have to give Google access to their raw SLAM data, and vice versa (plus Hololens, Magic Leap also opening up etc). While technically possible, this is a commercial bridge too far, as the key differentiators in the UX between various AR systems is largely a combination of hw+sw integration, then the SLAM mapping system capabilities.

So in the absence of all the big platforms agreeing to open all their data to each other, the options are either:

  • an independent & neutral 3rd party acts as a cross-platform relocalization service; or
  • a common open relocalization platform emerges.

My personal belief is that due to the very tight integration between the SLAM relocalization algorithms and the data structures, that a dedicated system built for-purpose will outperform (from a UX aspect) a common open system for many years. This has been the case for many years in computer vision, that the open platforms such as OpenCV or various open slam systems (orb slam, lsd slam etc) are great systems, but don’t provide the same level of optimzed performance of focussed in-house developed systems. To date, no AR platform company I know of is running or considering to run an open slam system (though many similar algorithmic techniques are applied in the optimized proprietary systems).

Note that doesn’t mean I don’t believe that open platforms don’t have a place in the ARCloud. On the contrary, I think there will be many services that will benefit from an open approach. However I don’t think as an industry we understand the large scale AR problems well enough yet in order to specifically say this system needs to be open vs that system needs to be as optimized as possible.

Relocalization != Multi-player. It’s also critical for…

This post is ostensibly about why multi-player is hard for AR, and it turns out it’s hard specifically for AR because its hard to make relocalization consumer-grade. There’s a whole bunch of other things to build to enable AR multi-player, which I touched on above, which could be hard to build, but are all previously solved problems. But… there are other ways that relocalilzation really matters, beyond just multi-player. Here’s a few:

  • the “cold start” problem: This refers to the very first time you launch an app or turn on your HMD, and it has to figure out where it is. Generally today systems don’t even bother to try & solve this, they just call wherever they start (0,0,0). Autonomous cars, cruise missiles and other systems that need to track their location obviously can’t do this, but they have a ton of extra sensors to rely on. Having the AR system relocalize as the very first thing it does, means that persistent AR apps can be built, as the coordinate system will be consistent from session to session. If you drop your pokemon at some specific coordinates yesterday, when you relocalize the next day after turning your device on, those coordinates will still be used today and the pokemon will still be there. Note that these coordinates could be unique to your system, and not necessarily absolute/global coordinates (lat/long) shared by everyone else (unless we all localize into a common global coordinate system, which is where things will ultimately end-up)
  • the absolute coordinates problem: This refers to finding your coordinates in terms of lat/long to an “AR usable” level of accuracy, which means it’s accurate to “sub-pixel” levels. Sub-pixel means that the coordinates are accurate enough that the virtual content will be drawn using the same pixels on my device, as your device if it was in the exact same physical spot. Usually sub-pixel is used for tracking to refer to jitter/judder so that the pose being accurate sub-pixel means the content doesn’t jitter when the device is still due to the pose varying. It’s also a number that doesn’t have a metric equivalent as each pixel can correspond to slightly different physical distances depending on the resolution of the device (pixel sizes) and also how far away the device is pointing (a pixel covers more physical space if you are looking a long way away). In practice having sub-pixel accuracy isn’t necessary as users can’t really tell if the content is inconsistent by a few cm between my device and yours. Getting accurate lat/long coordinates is essential for any location based commerce services (eg the virtual sign over the door needs to be over the right building), as well as navigation.
This is what you get when you don’t have accurate absolute coordinates (or a 3D mesh of the city)
  • the lost-tracking problem: the last way in which relocalization matters is that it is a key part of the tracker. While it would be nice if trackers never “lose tracking”, even the best trackers can encounter corner cases that confuse the sensors e.g. getting in a moving vehicle will confuse the IMU in a VIO system, while blank walls can confuse the camera system. When tracking is lost, the system needs to go back and compare the current sensor input to the SLAM map to relocalize so that any content is kept consistent within the current session of the app. If tracking can’t be recovered, then the coordinates are reset to (0,0,0) again and all the content is also reset.

Yeah but when?

Will this remain science-fiction?

So when will end users be able to play true multi-player Pokemon? or StarWars Holochess with their friend? It can be done today, if users are OK to accept a poor quality relocalization UX. If you are OK to relocalize using a common printed marker, or rely on “my GPS gives the exact same result as your GPS” and accept that you see might the pokemon through your phone on the sidewalk, but I see the same one through my phone placed in the middle of the road… then it can be done right now. But users have generally found that UX to be at least a little bit broken (even if it “technically” works). But in terms of when will we see a solid UX, then I expect to see solutions come to market around Q2-Q3 2018. I know this because it’s roughly when my startup 6D.ai is expecting to have a solution ready, and I know other startups (eg Escher Reality for example) are working on the same problems. There’s a chance Apple or Google may release an update to ARCore or ARKit that allows “in eco-system” multi-player, but I’d be surprised if they bring something to market faster than a startup can. I hear Apple has an ARKit update planned for around April that may support vertical plane detection amongst other tweaks, while Google seems to be very focussed on bringing ARCore to market on large numbers of devices as a priority.

Wrap-up

So true multi-player is IMO the single feature that is going to boost user engagement with AR Apps (there are others, such as absolute coordinates and very large scale outdoor apps). At a minimum it should allow far more engaging non-gaming smartphone AR apps to be built… but developers still have to learn what to build, and this will take a while. Adding multi-player to a bad concept won’t make it a compelling UX.

It’s a hard technical problem to solve relocalization, especially cross-platform, to a level that “it just works” for consumers. Once that problem is solved then the rest is just replicating work that has already been done for real-time MMO gaming platforms. As I indicated in my first ARKit post, 2018 is going to be an exciting year for AR enabling infrastructure….