How is ARCore better than ARKit?
A couple of days ago, Google announced ARCore, its competitor to Apple’s ARKit. Obviously this is great news for the AR industry, but what does it really mean for developers and consumers?
Isn’t ARCore just Tango-lite?
One developer I spoke to jokingly said “I just looked at the ARCore SDK and they’ve literally renamed the Tango sdk, commented out the depth camera code and changed a compiler flag”. I suspect it’s a bit more than that, but not much more (this isn’t a bad thing!). For example, the new Web browsers that support ARCore are fantastic for developers, but they are separate to the core SDK. In my recent ARKit post I wondered why Google hadn’t released a version of Tango VIO (that didn’t need the depth camera) 12 months ago as they had all the pieces sitting there ready to go. Now they have!
This is great news, as it means ARCore is very mature and well tested software (it’s had at least 2 years more development inside Google than ARkit had inside Apple — though buying Metaio & Flyby helped Apple catch up), and there’s a rich roadmap of features that were lined up for Tango, not all of which depend on 3D depth data, that will now find their way into ARCore.
Putting aside the naming, if you added depth camera sensor hardware to a phone which runs ARCore, you’d have a Tango phone. Now Google has a much easier path to get wide adoption of the SDK by being able to ship it on the OEM flagship phones. No one would give up a great android phone for a worse one with AR (same as no one would give up any great phone for a Windows Mobile phone with AR, so MSFT didn’t bother, they went straight to a Head Mounted Display). Now people will buy the phone they would have bought anyway, and ARCore will be pulled along for free.
If we do consider the name, I thought it interesting that Tango had always been described along the lines of “a phone that always knows its location”. I never met a single person who was impressed by that. To me, it positioned the phone as something more aligned with Google maps and AR was an afterthought (whether that was how Google saw it is debatable). With the new name, it’s all AR all the time.
But what about all that calibration you talked about?
Here’s where things get interesting… I spoke about 3 types of hw/sw calibration that Apple did to get ARKit so rock-solid. Geometric (easy) & Photometric (hard) for the Camera, and IMU error removal (crazy hard). I also mentioned that clock-synching the sensors was even more important.
Calibration isn’t a binary “yes it’s calibrated” or “no it’s not” situation. It’s statistics and doing more & more steps of diminishing returns to get the system stable enough for the use-case. The “more calibrated” the system is, the longer (in both time & distance) it will go before the errors in pose calculations become noticeable. Here’s what I think Google has done:
- Firstly they are being very cautious about which devices they support. Initially only Samsung’s S8 (biggest seller) and Pixel (Google’s own hw design). Both of these platforms already had Google engineers working to support sensor calibration for Inside-Out tracking for Daydream VR (3dof today, working towards 6dof). Google recently had engineers working with Samsung in Korea on calibrating & tuning sensors in their next phone to fully support Daydream, so back-porting some of that work to the S8 isn’t inconceivable. So we have two devices where the cameras and IMUs are already reasonably well calibrated and clock-synched (for Daydream).
- Google had been working for much of this year to merge the Tango and Daydream SDKs. I’d heard November was the target to complete that. I imagine that here in late August much of the low level work would have been done, meaning the Tango/ARCore VIO system could take advantage of Daydream sensor integration work.
- Lastly the real benefits of calibration become visible at the outer limits of the system performance (by definition). Both ARKit and ARCore can both track quite well for many meters before the user notices any drift. I haven’t seen any head-to-head tests done over long times/distances, but it doesn’t really matter. Developers are still getting their heads around putting AR content immediately in front of you. Users can barely comprehend that they can freely walk around quite large distances (and there’s no content to see there anyway). So in terms of how AR applications are really being used, any differences in calibration are pretty much impossible to detect. By the time developers are pushing the boundaries of the SDKs, Google is betting there will be a new generation of devices on the market with far more tightly integrated sensor calibration done at the factory.
For example I spoke to one of the largest IMU OEMs this week about this topic and he said that their mobile phone IMUs are only factory calibrated to a single operating temperature, in order to reduce costs. This means that the IMU hardware is tuned so it gives the fewest errors at this one temperature. As you continue to use the phone it gets hotter & this will cause the IMU to behave slightly differently than it’s calibrated for, and errors will result. This is fine for most IMU use cases (rotate from portrait to landscape mode for instance), but for VIO once the device heats up, the IMU measurements for dead-reckoning calculations become unreliable and the tracking drifts. This OEM can easily start calibrating for multiple temperature ranges if they are asked (and they will be!), meaning that’s one less source of error that Google’s ARCore VIO code has to eliminate device-type by device-type. Apple, being vertically integrated could address these challenges much faster, while Android needs to wait for the changes to filter through an eco-system.
Both ARKit and ARCore provide a simple estimate of the natural lighting in the scene. This is one estimate for the scene, irrespective if the real world is smoothly lit with ambient light, or full of sharp spotlights. ARKit provides intensity & color temperature back to the developer, while ARCore provides either a single pixel intensity value (Android Studio API) or a shader (Unity API). Both approaches seem from early demos to give similar results. Subjectively Google’s demos look a bit better to me, but that may be because Tango developers have been working on them for much longer than ARKit has been released. However Google has already been showing (at I/O this year) what is coming soon (17:11 in this video) which is the ability to dynamically adjust virtual shadows & reflections to movements of the real world lights. This will give a huge lift in presence where we sub-consciously believe the content is “really there”.
Mapping is one area where ARCore has a clear advantage today over ARKit. Mapping is the “M” in SLAM. It refers to a data structure that the device keeps in memory, which has a bunch of information about the 3D real world scene, that the Tracker (a general term for the VIO system) can use to Localize against. Localize just means figure out where in the map I am. If I blindfolded you and dropped you in the middle of a new city with a paper map, the process you go through of looking around, then looking at the map, then looking around again until you figure out where on the map you are… that’s Localizing yourself. At its simplest level a SLAM map is a graph of 3D points which represent a sparse point-cloud, where each point corresponds to coordinates of an Optical Feature in the scene (eg the corner of a table). They usually have a bunch of extra meta-data in there as well, such as how “reliable” that point is, measured by in how many frames has that feature been detected in the same coordinates recently (eg a black spot on my dog would not be marked reliable because the dog moves around). Some Maps include “keyframes” which are just a single frame of video (a photo!) that is stored in the map every few seconds, and used to help the tracker match the world to the map. Other maps use a dense point-cloud which is more reliable but needs more gpu and memory. ARCore and ARKit both use sparse maps (without keyframes I think).
So how it works is that when you launch an ARCore/ARKit app, the Tracker checks to see if there is a Map pre-(down)loaded & ready to go (there never is in v1.0 of ARCore & ARKit), so the tracker Initializes a new map by doing a stereo calculation as I described in my last post. This means we now have a nice little 3D map of just what is in the Camera’s field of view. As you start moving around, and new parts of the background scene move into the field of view, more 3D points are added to the map and it gets bigger. And bigger. And bigger. This never used to be a problem because trackers were so bad that they’d drift away unusably before the map got too big to manage. That isn’t the case anymore, and managing the Map is where much of the interesting work in SLAM is going on (along with Deep Learning & CNN AIs). ARKit uses a “sliding window” for it’s Map, which just means that it only stores a variable amount of the recent past (time and distance) in the map, and throws away anything old. The assumption is that you aren’t going to ever need to relocalize against the scene from a while ago. ARCore manages a larger map, which means that the system should be more reliable.
So the punchline is that with ARCore, even if you do lose tracking then it will recover better and you won’t be impacted.
Both ARCore and ARKit also use a clever concept called “Anchors” to help make the map feel like it covers a larger physical area than it does. I saw this concept first on Hololens, who as usual are a year or more ahead… Normally the system manages the Map completely invisibly to the user or app developer. Anchors allow the developer to tell the system “remember this piece of the map around here, don’t throw it away”. The physical size of the anchor I think is around 1m x 1m square (that’s a bit of a guess on my part, its probably variable depending on how many optical features the system can see. It’s enough for the system to relocalize against when this physical location is revisited by the user). The developer normally drops an anchor whenever content is placed in a physical location. This means that if the user then wanders away, before anchors, the map around the physical location where the content should exist would get thrown away and the content would be lost. With Anchors, the content always stays where it should be, with the worst UX impact being a possible tiny glitch in the content as the system relocalizes and jumps to correct for accumulated drift (if any).
The purpose of the Map is to help the Tracker in 2 ways: The first is that as I move my phone back & forth, the Map is built from the initial movement, and on the way back, the features detected in real-time can be compared with the saved features in the map. This helps make the tracking more stable by only using the most reliable features from the current & prior view of the scene in the pose calculation.
The second way the Map helps is by Localizing (or Recovering) Tracking. There will come a time when you cover the camera, drop your phone, or move too fast or something random happens and when the camera next sees the scene, it doesn’t match what the last update of the map thinks it should be seeing. It’s been blindfolded & dropped in a new place. This is the definition of “I’ve lost tracking” which pioneering AR developers would say about 1000 times a day over the past few years. At this point the system can do 2 things:
- just reset all the coordinate systems and start again! This is what a pure odometry system (without a map at all) does. What you experience is that all your content jumps into a new position and stays there. It’s not a good UX.
- or the system can take the set of 3D features that it does see right now, and search through the entire Map to try and find a match, which then updates as the correct virtual position and you can keep on using the app as if nothing happened (you may see a glitch in your virtual content while tracking is lost, but it goes back to where it was when it recovers). There’s two problems here: (1) as the Map gets big, this search process becomes very time & processor intensive, and the longer this process takes the more likely the user is to move again, which means the search has to start again… ; and (2) the current position of the phone never exactly matches a position the phone has been in the past, so this also increases the difficulty of the map search, and adds computation & time to the relocalization effort. So basically even with Mapping, if you move too far off the map, you are screwed and the system needs to reset and start again!
Note when I refer to a “big” map, for mobile AR, that roughly means a map covering the physical area of a very large room, or a very small apartment. Note also this means for outdoor AR we have to think about mapping in an entirely new way (subject of a forthcoming post on multi-player, permanence and why really really good Mapping & Localization is so important). There’s a nice Tango research demo of large scale mapping at 1:19 here.
Robustly relocalizing against a Map is a very very very hard problem and is not solved to a consumer UX level yet IMO, by anyone. Anyone claiming to offer multi-player or persistent AR content is going to have their UX very limited by the ability of a new phone (eg Player2) to relocalize from a cold-start into a map either created by Player 1 or downloaded from the Cloud. You’ll find Player 2 has to stand quite close to Player 1 and hold their phone in roughly the same way. This is a PITA for users. They just want to sit on the couch opposite you and turn on their phone and immediately see what you see (from the opposite side, obviously). Or to stand anywhere within a few meters of a prior position and see the “permanent” AR content left there.
Note there are app specific workarounds for multi-player you can also try, like using a marker, or hard-coding a distant starting position for P2 etc. Technically they can work, but you still have to explain what to do to the user, and your UX could be hit or miss. There’s no magic “it just works” solution that lets you relocalize (ie join someone elses map) in the way ARKit/ARCore make VIO Tracking “just work”.
I would love to be educated by anyone who knows differently, but this level of relocalization robustness just isn’t possible AFAIK with any of today’s relocalizers or even any that have been published in the research. It’s one of those problems like VIO, that only a very small number of people can possibly solve. I only know of one unpublished (in stealth) system that reportedly can support consumer grade robustness, and is hoping to come to market in early 2018.
The iPhone-8-keynote sized elephant in the room
I’m pretty impressed with whoever inside Google reacted so fast to ARKit and came up with the best possible spoiler to Apple’s iPhone8 Keynote. ARCore has:
- just enough extra features than ARKit that Apple can’t easily claim they’re better on paper
- a few years of content experiments from Tango & Daydream that work on ARCore and are visibly more mature than what devs could build in a month or two of ARKit work
- enough OEMs in the pipeline that they can claim similar reach “real soon”
- recognition that the way most people will experience these apps (at least the marketing of the apps) is via Video / YouTube. Whatever Apple shows at their keynote will not look (on video at least) like they have more advanced technology than what’s in the ARCore videos. The “technical breakthrough” aspect of ARKit messaging will be dulled a little
All that being said though… I still expect the world to go crazy for ARKit and to believe it’s the first & best AR system anywhere. Apple is the best at marketing “new” tech and they’ve done this many times before. At least Google can now point to ARCore and say “we have that too!” to whoever will listen.
OEMs still have reservations
I get the sense that ARCore was a pretty rushed product launch, and a repackaging of existing assets e.g. there’s no ARCore logo yet. I talked in my ARKit post about OEMs reservations toward Tango being hardware and Android lock-in. ARCore eliminates the camera stack hardware commodification concerns and Bill Of Materials cost issues with the Tango hw reference design. It looks like Google has conceded some strategic control here, though honestly I think this all happened so fast that those conversations haven’t seriously taken place yet.
But Google is insisting to OEMs that ARCore is part of Google Mobile Services (GMS) Android, which the OEMs don’t like. If AR is truly “the next platform” then it’s an opportunity to reset the power balance in the eco-system. ARCore isn’t just a new feature like some new GUI widget. No startups have been told “oh we don’t need to talk to you about SLAM anymore because Google’s solved the hw requirements problem”. The conversations around decoupling AR from GMS are still very much underway.
A lot of the resistance to GMS is that it isn’t welcome in China, which is the biggest OEMs biggest market. They would like one AR software solution that works globally (as would every developer and AR tool maker).
So should I build on ARCore now?
If you like Android and you have an S8 or Pixel, then yes. Do that. If you like iPhones, then don’t bother changing over. The thing developers should be focussing on is that building AR Apps that people care about is really challenging. It will be far less effort to learn how to build on ARKit or ARCore than the effort to learn what to build. Also remember the ARKit/ARCore SDK’s are version 1.0. They are really basic (VIO, plane detection, basic lighting) and will get far more fully featured over the next couple of years (3D scene understanding, occlusion, multi-player, content persistence etc). It will be a constant learning curve for developers and consumers. So for now, focus on learning what is hard (what apps to build) and stick to what you know for the underlying tech (how to build it : Android, IOS Xcode etc). Once you have a handle on what makes a good app, then make a decision as to what is the best platform to launch on with regard to market reach, AR feature support, monetization etc
Is ARCore better than ARKit?
I think as a technical solution they are very very close in capability. Effectively indistinguishable to users when it comes to the user experiences you can build today. ARKit has some tech advantages around hw/sw integration and more reliable tracking. ARCore has some advantages around mapping and more reliable recovery. Both of these advantages are mostly only noticeable by Computer Vision engineers who know what to look for.
Apple has a clear Go To Market advantage with a huge installed base of devices that immediately upgrade to the latest IOS which will include ARKit. Apple’s users generally spend more money, so in the medium term, AR Apps should monetize better on ARKit. Android’s advantage has always been scale. It will take at least 12 months for the Android ecosystem to get all the pieces together & deals done to get ARCore supported by hardware in most new devices (maybe longer as October is usually when OEMs make 2018 product decisions, so this month is the only chance to make any changes for the next year. Lots of OEM product managers will have a tough month).
ARCore has a nice advantage in the pipeline of Tango R&D sitting on the shelf, waiting to be deployed, much of which has had at least a little bit of user/market testing. It’s going to be fun to see how fast these systems evolve over the next 12–24 months, now that the foundations are in place!
I think the constraint in the rate of AR App adoption in the market will be the ability of the developer community to figure out what to build. The tech is finally (just) good enough to build consumer products on. If you’re an AR developer and you don’t have a senior product/interaction designer on your team, I’d encourage you to prioritize hiring one!
The question of which is better, ARKit or ARCore, right now comes down to personal preferences & goals of the developer. Both systems have their strengths & weaknesses, but what’s important is that both can enable a good enough consumer experience that developers have wide open spaces (literally and metaphorically) to explore.
I always love to chat and especially learn, but not argue.
You can reach me on twitter @mattmiesnieks or email firstname.lastname@example.org