The Map and the Territory: On the Balkanization and Semiotics of Augmented Reality
This post is a distillation of thinking and conversations begun with Mark Domino and Jasper Speicher back in 2009. Despite advances in hardware, not much has changed since then.
Below I’ll make the argument that current state of augmented reality is much like that of the early internet, and that the same issues that confronted the early web are standing in the way of AR’s potentially transformative adoption.
I’ll also try to address some of the pitfalls in the possible near-term resolutions to these issues.
There are points of inflection in the growth of technologies — ones where no one individual makes an active decision and yet, collectively, we somehow choose a path that fetters new tech with proprietary interests and limits the potential of its life-altering possibilities. I’m looking at you, landline phones, and you, high-speed internet, and you, too, airlines.
This post is based on the assumption that we are headed, regardless of ownership, for a near future with ubiquitous, always-on AR, widely adopted in all the contexts in which internet-connected smartphones now prevail. From a hardware standpoint, that might mean augmented glasses, contacts, windshields on your car, brain implants, suppositories, or what have you.
If you’re not with me on that premise, much of the below isn’t going to work for you. Maybe someday that argument will get its own post here, but to me the conclusion is foregone sufficiently to that we don’t need to argue the case.
THE CURRENT STATE of THINGS [BALKANIZATION] and PARALLELS with the EARLY WEB
As it is now, AR, whether enabled through wearable hardware or delivered in ‘magic mirror’- or ‘looking glass’-style interfaces on smartphones, is deployed and consumed in a way that resembles the distribution patterns of the pre-web internet. Predominantly, consumer AR systems are designed to run only one program at a time, and for internet-enabled applications, make a connection to only one other system or data model at a time.
Correspondingly, on the content-creation side, creators are required to re-author their content once per application. If I want to create an experience for, say, the Layar AR platform, I go through their UI-based backend (already not a scalable approach) and deliver a version for that target. If I want that same experience on Aurasma, or Wikitude, or BrowsAR (I thought I made that one up, but it exists), or some other whitelabeled marker-tracking ad hoc fiefdom, I’ll need to re-scope the experience and re-author my content once per platform.
Early internet applications and even, to a degree, early applications for the web, had comparable limitations. At a certain point in the 1990s, if wanted to, say, transfer files to another user, I’d call that user on the phone, tell them to get ready, launch some kind of socket application, dial up a connection to that user, and transfer them the file. If I then wanted to chat with them, I’d disconnect the session, call them up again, tell them to launch a chat app, close the socket, and open a chat client on my side.
Part of the power of the early web was that it enabled something closer to a write-once-run-anywhere presentation of arbitrary content. This ability began with HTML, Tim Berners-Lee’s adaptation of an already-accepted markup standard SGML. The platform-agnostic markup of HTML enabled content-agnostic platforms, and shortly thereafter the ‘browser’ was born, beginning somewhere around the arrival of the Lynx text-based browser in 1992.*
It was this critical decoupling of content and presentation that made the web possible, and it was in this context that the push/pull of innovation and standardization took place, bringing us to the point where browsers can now run concurrent, full-featured applications written in a (somewhat) browser-agnostic fashion.
This is not to say that there aren’t careers’ worth of human talent dedicated to the art and science of tailoring content for the web’s many delivery targets — more to highlight that such a situation is many leaps beyond the current thinking surrounding the creation and presentation of AR content. Right now it’s a race to define the most compelling proprietary platform in the hopes of becoming a de facto standard. In doing so, these early entries hope to define a market around a closed system.
* Patently this is a wild oversimplification and glosses the complexity of the browser/standard relationship that exists to this day. I’m trying to make a point.
BUT THE PROBLEM IS BIGGER THAN THAT
These issues could be resolved through means similar to the ones that standardized the web and 3D graphics — standards themselves — and while the establishment of a flexible markup or open SDK for ‘AR browser’-agnostic content would be a consensus-seeking design challenge, it wouldn’t be reinventing the wheel. Indeed, some are at it already.
Wide adoption of a common markup is imaginable, and with that would come the possibility of one or many content-agnostic (and, importantly, hardware-agnostic) ‘AR browsers’, much as we have for screen-based content on the web. Not such a stretch to imagine, but again, not where things appear to be headed at the moment.
The bigger problem comes with the context-aware nature of real AR applications. Where user input to the web is limited to keyboard and mouse input, AR applications are expected to respond to stimulus much as our own minds do. AR hardware should relay to AR software and connected systems any visual input, audio input, voice commands, geolocation information, etc. from all sensors available.
But what is such a system to do with this firehose of information? Sure, you could envision an application that does some work with your location and orientation to overlay map markers on a heads-up display, but we have that today, and, lord, is it boring!
Taken in the case of the web, search provides hyperlinks to remotely-hosted content in response to text-based input. The analogy to its AR equivalent fails just after the word ‘search’. On the results side, hyperlinks won’t do (we’re not mousing about and ‘clicking’ in AR), and the input to the constantly-running ‘search’ is visual and sensor data, not carefully-composed text queries.
In the most general sense, AR search is required, as a first step, to do the same thing our brains do with just about the same information. That sounds easy, doesn’t it? It doesn’t.
TEACHING a ROBOT SEMANTICS
There are a million hard technical problems involved in teaching a machine to recognize what it’s looking at. I’m not going to enumerate them, both because I’m unqualified and because this post is not directly about machine learning or computer vision or any of those things.
It’s about the next problem, after the machine identifies a thing, or a specific instance of a thing, or place, or an individual in its user’s proximity, because that’s the issue on which the unwritten future of AR, and the jetpack you were promised, hinges.
When an AR system resolves what it’s looking at, it’s made one link in the chain of stimulus-response that leads to the system actually providing augment to your surroundings. For example, when I point my AR contact lens device at a Coke can, it’s going to take in some color information (video), possibly some depth information (3D point-cloud), some scale information, location, sound, temp, altitude, etc., and it’s going to use the information available to identify the object.
Currently, smaller-scale author-driven AR systems require content creators to input 3D models and/or 2D textures at the time of creation. The application is then ‘searching’ the input video stream for objects or textures that match something from their database of known targets, and when it identifies one, it presents the response designed by the content creator. Those responses are currently limited, for the most part, to the superimposition of video or 3D content in a pre-defined spatial relationship to the marker or object recognized.
In order to understand the problems inherent in this current system — a group of publishing platforms where the augment-response to input is determined by individual authors, one at a time, in closed systems — we need to look at an example with a lot of obvious stakeholders.
In our hypothetical future of content-agnostic AR browsers running on arbitrary hardware, what happens when we look at a Coke can? To begin with, (BIG gloss here, and more on that below), the system will recognize the can much as our minds do, up to the point of saying “here you have an instance of something that falls into the category ‘cans’ of ‘Coke’”, perhaps even with further specificity like ‘dented’, or ‘open’, or ‘manufactured in Atlanta as part of the November 2017 production run’, etc.
Importantly, this is a symbolic recognition, more abstract and more robust than the current norm of a texture search that ties a single known texture to its defined augment on a single platform.
As a recent example, Google’s newly-announced Cloud Vision API uses the machine learning approach to CV, and, much as Google Goggles (not Glass, mind you) did before it, provides identifications (or ‘labels’) for the actual content of images, as well as geolocation for landmarks, facial detection, and OCR. It stacks these identifications, with paired confidence scores, in responses accessed through a REST API. Through this service, developers can tap Google’s significant capability to identify the content of images. One could see the abilities of the recognition system expanded further, with the system providing more (and more specific) labels in its responses as time progresses, perhaps even categorizing the world’s nouns into a linked taxonomy, as with the WordNet and related ImageNet approaches.
Regardless of the implementation, here we come to a crucial decision point: what happens next? What augment is presented after the system recognizes you’re in the presence of a Coke can? Remember — it’s the future, these systems have made their way into widespread adoption — they’re mass produced and widely used — does nothing happen? Is nothing in the way of an augment presented in response to the system recognizing a Coke can? Probably not, right?
Does the label seem to animate with a subtle and twee little Coke ad? Does the AR equivalent of a popover play in the air above the can, or does a beautiful family of healthy people in sweaters seem to gather around in photorealistic 3D and drink a toast to you with their own phantasmagorical Coke cans? Or does a Pepsi ad play in front of the can, obscuring my view of the Coke-branded label?
Can I run an AR adblocking plugin or pay for the AR equivalent of a premium service to reduce my run-ins with sponsored content and just see what, in fact, is?
Say I visit Times Square and look up at a Samsung LED billboard. Do I see the billboard as it is? Who says?
What about the faces of the people around you? The text you’re reading from a screen? The blank wall in your apartment? You get the picture.
The matter is complicated by the fact that a ‘popover’, such as it might be, would be unsafe if I were, say, operating a tablesaw, or driving a city bus, so these matters of interpretation, far from being simple in a single context, may vary wildly with any number of other complicating factors.
The issue is, in part, one of owned semantics, and, in a larger sense, networked semiotics. Getting the system to recognize the ‘sign’ is relatively trivial in comparison to the much more involved issue of deciding what is ‘signified’ in an environment fraught with issues of intellectual property and whatever we’ll call the yet-unlegislated right of corporate entities and individuals to decide what, if anything, happens, in ‘associative proximity’ to their ‘semantic identity’ and ‘affiliated physical and relational identities’.
The combination of the ownership and IP issues from meatspace with those in cyberspace will be far more than the sum of their parts. It’s gonna get ugly.
IT’S A UTILITY OR WE’RE BONED
For those of you paying attention, this issue might sound familiar. Underlying the DNS is a web-wide agreement to abide the assigned associations managed by the ‘non-governmental’ ICANN — a US-based non-profit corporation with exclusive authority over the semantics of the URL as it relates to IP (in this case Internet Protocol, not Intellectual Property) space. While that exclusivity can be fragile at times, it’s been foundational to the neotenic Web, creating a universal text-based whitepages whose IP (Intellectual Property) issues were resolved relatively easily due to their resemblance to existing caselaw around the use of trademarked names.
I don’t believe trademark case law will quite so easily extend itself to the intellectual property and privacy disputes that AR will present.
‘Ownership’ of the ‘who says’ position, as with many things, resembles a utility, or a common good, and could, through primacy and lobbying and lawyering up, end up in private hands. I probably don’t need to explain that I think that would not be the best outcome. In order for that to be a possibility, though, the system would need to be designed to rely on a central ‘who says’ authority with a profit motive behind it.
That didn’t happen with the DNS only because nobody realized how very valuable sitting in ICANN’s position would be — at least, not until it was too late. That doesn’t mean it won’t happen with AR, and one could guess that Microsoft’s interest in developing AR hardware is probably not limited to the profit to be made on the gear itself. That’s never been their game, so to speak. They (and Magic Leap, from what it seems, and likely many more soon) are interested in owning the pipeline itself — everything from the means of production to the content to the means to deliver it, and you could make a similar assumption about Google’s latest offering of CV-as-service.
This brings us to the technical issues underpinning the implementation of a content-agnostic, sensor-driven, context-aware, always-on consumer-side AR system, because solving them is non-trival, and simply doing so first might be enough to take the cup.
CLIENT SIDE: IT’S AN OS, STUPID
As Google has demonstrated, there’s not much of a difference between a browser capable of executing arbitrary code and an operating system. Much like some OSes use a window manager to make sure GUIs play nice, so would our AR OS make sure that content is presented in such a way as to respect the context, our preferences, and our human sensory bandwidth.
For example, you might be totally fine with having the front sheet of the NYT read aloud to you as you drive a car, or even perhaps having a small PIP video chat in the corner of your vision, but you really don’t want much more than that going on while speeding down the highway. Conversely, if you’re at home and you’ve padded the important corners, you might be OK with a more immersive experience that would be unsafe to run on a street corner, but you wouldn’t want popups telling you that your gas bill is due.
In terms of a not-too-granular user-defined context, there are analogies to the ‘permissions’ settings on today’s smartphones, where we can assign particular applications access to our attention and our screen real estate under specific circumstances. It’s a start, and, notably, while some web browsers offer notification and permission settings of their own, this type of feature is typically the domain of a unifying layer like an OS. The status quo of one-app-per-feature just won’t scale, especially when the system is so truly pervasive as to make comprehensive user-defined permissions unfeasible (again, see the titular map).
Problems with the output aside, what about the task of this operating system to accept and (post? disseminate? stream?) the sensor input and context information necessary to drive the many layers of applications we’ll be running on these systems? Don’t forget — we have no keyboard or mouse (God willing) — it’s all driven by the same stimulus we use ourselves to make sense of the world around us.
THIS CAN GO TWO WAYS
Much as with our perennial expectation that the quality of film CGI be years beyond what realtime systems can deliver, it will be persistently beyond user-side systems’ ability to handle the vision and lookup tasks necessary to make sense of the world around them in a way that meets our ever-increasing expectations for speed, accuracy, and breadth.
That means that our devices will offload this work over a low-latency, high-bandwidth connection. They’ll stream that stimulus and context information to a recognition instance somewhere that has a degree of statefulness for that user, probably with access to information about the user and their contacts, history, preferences, habits, etc., much as our devices and networked services do now.
[Edit — here’s a counterexample already. TBD how scalable this is. It’s 130MB for a limited recognition palette.]
Because of those limitations, up to that point, all potential approaches are identical. It’s in the nature of the returned result that the possibilities diverge, and at the heart of the matter are these issues of ownership and profit motive.
In an AR ecosystem where recognition services are provided by a non-profit or otherwise-’neutral’ entity, that server will constantly ingest these streams of information and will extrapolate from the provided stimulus a number of discrete associations, devoid of extrapolated meaning and without context (see, for example, Kyle McDonald’s NeuralTalk and Walk):
'That's a bicycle. That's 124 Eagle Street. That's your friend Willy. That's West. That's a UPS truck. That's the street. That's the curb. That's a traffic light and it's green.'
… this collection of discrete identifications will get piped back to the hardware as some kind of markup for the world, ready for user-side applications to interpret.
That means the recognition of stimulus and its interpretation as augments through user-side code could continue to function as the web does now — in a physically-distributed network of identical associations controlled by some kind of widespread, agreed-upon authority, à la DNS. Critically, it means that the recognition and interpretations tasks are decoupled.
Of course, such processing capability would come at great cost to the provider, both in development and operating expenses, and, as such, would necessitate that that provider recoup those expenses somehow. In the case of ICANN, the closest web equivalent of this service is provided through a non-profit, and that seems to be working out OK. In that system, nominal registration fees, and, more recently, new gTLD auctions, have kept the organization in the black.
The type of system we’re contemplating here won’t be quite so simple to develop or maintain as the root name server of the DNS, and, as discussed above, it’s not so simple to define what can and can’t be sold as of yet — especially by a nonprofit — so what would keep the lights on at a notional Internet Corporation for Universal Identification and Taxonomy? Even with the democratization of deep-learning tools, GPU farms aren’t cheap to operate and bandwidth isn’t free.
It’s easier to imagine a future where this recognition service is all but monopolized by a small number of Google-scale entities. Companies offering recognition-as-service would seek to vertically expand their integrations, closely tying consumer-side hardware and software with recognition services, much like the process we’ve seen smartphones, game consoles, e-readers, etc.
Once the loop is closed, there’s nothing to stop a provider from offering up unprecedented consumer access to the highest bidder.
This could mean any number of monetizations — the presentation or prioritization of sponsored content around certain products, places, and people, for sure, and even specific contexts, like maintenance guidance when you look under your car’s hood, or product reviews in a retail setting. Think of the staggering economic power of a corporation with fully integrated control of an always-on audio-visual system literally embedded into the consciousness of millions of consumers. You can at least look away from a TV ad.
Imagine how different these applications would look under the thumb of a profit-seeking corporation than if they were the product of an unowned, decentralized ecosystem.
In a vertically integrated system it would be feasible to stream pre-rendered augments back as a flattened feed, an approach that gets its day in the sun once every few years in the world of gaming, or even web-based applications. Bruce Harris, Technical Evangelist for Microsoft’s HoloLens project, recently confirmed they’ve got that idea in back of mind, at least.
By way of further example, a bit later in the same talk, Harris confirms that Microsoft’s intent with the HoloLens platform is to route all multi-user experiences through a server (probably a Microsoft server) for synchronization. Those multi-user experiences are, I would argue the ‘shared hallucination’ that will be fundamental to a good bit of AR’s widespread use. More on that in a later piece.
Server-side applications and pre-rendered content would further tighten vertical integration, obviating any issues around client-side reverse engineering or ‘jailbreaking’ and turning client hardware into simple dumb terminals for what would be marketed as a ‘free’ service simply festooned with commercial content.
WHAT IF WE DID IT TOGETHER
“We reject: kings, presidents, and voting. We believe in: rough consensus and running code.” — The Tao of IETF
All but the most dyed-in-the-wool libertarians would have to acknowledge that the combination of profit motive with a monopoly over a service critical to participation in the modern world (what we call a utility) can lead to bad things.
All but the leftiest among us would have to acknowledge that, speaking broadly, the more modern and complex the system, the worse it fits the type of solutions government offers. This especially with the recent trend of congressional representatives proudly touting their lack of knowledge about the plumbing of the internet. Add transport layers and syntactics into the mix and we might put the hon. representative from Iowa into a coma.
Even were a non-governmental non-profit like ICANN to standardize recognition tasks, we’d have a touchy situation fraught with its own problems around censorship, cultural norms, taboos, geopolitical conflict, and government as IP cop, with issues similar to, but far exceeding, Google’s never-ending challenges in the prickly world of maps. It would be ground zero for a never-ending fight with no higher authority to arbitrate disputes.
There’s an argument to be made that the scale of the recognition problem itself indicates that the only feasible, scalable solution will have to rely, in part, on the accretion of user input, something akin to the the distributed OCR of the reCAPTCHA project. But there’s that mind projection fallacy cropping up again, as, until just recently, before the techniques we call ‘deep learning’ and ‘machine learning’ bore fruit, problems like voice recognition and computer vision seemed too complex for the means at hand.
Those early approaches involved, for the most part, attempts to break complex recognition problems down into a mélange of smaller, mostly unrelated tasks. Deep learning, while more self-directed, still needs vast, correctly-tagged data sets and reams of real-world experience to train from, and from where we’re sitting, we’d assume that means humans have to do that tagging work first — a job that may be beyond the means or ability of even the most deep-pocketed Mechanical Turk HIT.
The explosive recent success of deep learning in a set of rapidly-expanding and disparate fields should be a hint, though, that maybe the challenge of finding appropriate training material might not be a hurdle for long.
Still, even with advances enabling a modest organization to train a recognition system capable of the types of tasks we’re contemplating, data centers don’t come cheap, and short of Wikipedia, we haven’t seen many instances of viable large-scale self-organizing utility projects cropping up on the young web. It’s not even a matter of finding somebody to pay the bills, which Wikimedia manages to do — somebody has to steer the ship, and even Jimmy Wales himself hit a few icebergs driving a thing so (relatively) simple as an encyclopedia.
But let’s assume we’ve learned a bit from our recent mistakes, and you’re with me so far — what would it take to begin the work of building what might end up being a new vertical in the Internet Protocol Layer and a new utility all in one? What would this organization look like?
Is it even possible to imagine a future where this role is filled by anything but a profit-seeking corporation?
Can a distributed group, or a standards organization, or a government body, even fill this role, and could they do it first? Or better?