When Computers Understand Your World Before You Do.

Albert Dong
Inborn Experience (UX in AR/VR)
9 min readNov 27, 2018

2/x — This piece is part of a series of exploratory essays where I will be sharing my thoughts on a topic I’ve been thinking about in the AR/MR space. Let me know if this resonates with you or if you dispute these thoughts — I’d love to chat about it! 1/x is here.

Someone introduced me to a perspective that’s been on my mind the last couple of the days.

The idea: humanity is currently in the process of birthing an alien.

Not in the biological sense, but alien in its mode of perception. Our child, the AI that we are creating through code, is something who fundamentally perceives the world differently from humans.

Wild, right? Let me explain.

While both a human and a computer can take a glance at a living room and quickly find the couch, computers do not understand the concept of a couch the same way a human does.

They’ll take a look at any given pixel cluster and compare its form to the billions of images it’s been trained to recognize. Through pattern matching, the computer will make a conclusion on what that pixel cluster, that object, may be.

If an artist were to reimagine the form of a couch and place their model in an empty room it might look like this —

As people, we would be able to identify that it is still a couch because of its perceived purpose.

A computer would struggle to do the same. It’ll have no understanding of this bizarre, foreign object as it has not been trained to perceive this object.

Humans label objects by taking a multitude of factors — context, relevance, purpose — into account. Computers labels objects by only taking the similarity of the object with previously identified objects into account.

Now let’s put this in the context of an ubiquitous, augmented reality medium which all visual perception goes through — Basically AR glasses such as MagicLeap. Before we are even able to make sense of any object we’re seeing, the computer in our glasses will have already categorized the object and assigned it a label, the name/definition of an object, which is disseminated across its network for other applications to process.

Perception with the aid of Head-Mounted Displays

So, in such a world, where your visual perception process now has a new overseer, one whose knowledge and cognitive speed is magnitudes quicker than your own, how will our interactions with the world change as a result?

I’ve been toying around with a few ideas.

Future Form of the Dictionary

Okay, first thought.

If there is no centralized database in which AR glasses extract information from, then the labeling of real-world objects and ideas may differ from software to software. In this case, what determines correctness of definition? Who defines what a couch actually is?

While a variety of dictionaries already exist (Merriam-Webster, Dictionary.com, etc) and are all somehow treated as the de facto norm, they supply definitions which are then interpreted by a human to see whether it is applicable or not or what a human is presently seeing.

AR glasses would blackbox this and output a label to an object directly, removing human interpretation from this process. If the computer thinks it sees a couch, it will give it the label of couch regardless of what the human wearer thinks it is. Given the plurality of definitions for any given word, such labels will find their singularity in the spiritual definition of the word (as used by the collective conscious — what the masses commonly think of when they refer to a word) rather than the literal definitions.

So, in the case of when a new object is created and there is no precedence of its existence on the internet, who gives the label to this new object? Is it the will of the creator, the collective conscious of the masses, or the new player — the computer in your glasses? Can it even have multiple labels, as objects often do when viewed by people with different backgrounds, or will every object in the world be categorized according to a singular name/definition?

Adding new labels to the databases used by AR glasses will be tough. For creators, this will pose a major problem as they will have to be the ones to add the labels for their new creations. A computer would not be able to infer the labels of objects from the object’s purpose the same way humans do. They would not know that the aforementioned avant-garde couch is indeed, a couch.

Adding new labels to one such database would require (at the very least):

  1. an user friendly interface that non-coders could utilize to input labels and pictures
  2. a method to discern originality — who is or isn’t the originator of such a form + who has the right to edit an object’s label
  3. a method to disseminate this update across the entire network
  4. a way to keep the labeling consistent with the cultural norms and the spiritual definitions of the word

Now if different databases exist for different glasses, adding these new labels to multiple databases will be factors tougher for the creator.

Not a fun time 😬

Lost-In-Translation

The previous thought assumes definitional variance across one language. Luckily, we have over 6,500 of them — which is a beautiful thing, I love learning new languages. However, our AR future may cause the majority of them to become extinct.

In our future, it is possible that the visual definition of words, as defined by the set of images used to construct them, will only exist one or two languages. As such, it will only adhere to only those culture’s understanding of the word. Given that the US and China are at the forefront of the AR race, it’s safe to assume such definitions will be constructed in English or Mandarin. For speakers of other tongues, any object will be labeled in one language by the computer and translated to the user’s native tongue before outputting it to them. In doing so, meaning is lost as direct translations often cause the cultural meanings of words to be skewed.

For speakers of other tongues, the problems that come with the black boxing of interpretation magnifies. Now, there are two points in the visual perception process which a comprehension error can occur. 1) when actually labeling an object and 2) when translating that label to a foreign language without skewing its intended meaning.

A quick example here would be the English word, loveseat. Loveseat is a popular term used to refer to a cozy sofa meant to seat two people who are in some form of a romantic relationship. In Mandarin, the equivalent to loveseat is 双人沙发, which translates more closely to “two-person sofa”. Both refer to the same thing but the Mandarin equivalent carries none of the connotations nor implications of the English word.

In such a future, what will the effect of ubiquitous AR be on the variance of perspectives and viewpoints, informed by cultural-linguistical differences, spread across humanity?

If English or Mandarin becomes the predominant language, then beautiful words such as sangfroid or zugzwang that have not been integrated into the English language will cease to exist and we, as a species, will have lost the vocabulary to engage with those ideas. I’m worried that this technology will cause for words whose meanings are only captured by its connotations in its native language and culture to be lost or have their meanings skewed by this digital consolidation of language.

While it is possible to construct visual dictionaries of all languages using different datasets, the difficulty to do so properly will likely lead to the creation of one or two “complete” dictionaries that are comprehensive enough to be useful. The trick here is to make it easy for speakers of minority languages to add their own words and edit existing ones to fit their own cultural interpretations. It’s a hard UX problem that’s important to solve if we are to construct a future that doesn’t erase the ideals of minority cultures.

Arbiters of Association

What’s more complex with this future dictionary isn’t so much the labels that it assigns to objects, it’s the effects of the associations derived from of these objects. Couches have their literal definitions but are also defined by their associations with other objects. A woman lying a couch may be seen as relaxed and a car driving right at you may be seen as dangerous. What follows the labeling of objects are their intersections with other ones. Together, they are used by the computer to understand situations and support the user.

An example that is relevant to our current sociopolitical climate is the image of two men holding hands. Feeding a dataset scraped from the internet into a machine learning model would cause for such an image to be labeled as brothers. However, these two people are equally likely to be of husbands, as changing social norms has made public displays of gay love more acceptable. #LoveIsLove. As a computer would not have this context nor this cultural understanding, it would often mislabel the image and the relationship.

Two men holding hands is two men holding hands. The fact that they are men (sex, not gender), have hands, and are holding each other’s hands are facts. However, if they are interpreted to be brothers based on those factors, then that goes beyond labeling the facts and is the computer making a conclusion in order to obtain more information to support you with. Perhaps you want a picture of yourself, the computer may recommend you ask them to take the picture since their mood is not romantic and unlikely to be spoiled. If you did ask for the picture and they were actually a gay couple, then yikes, you may have just spoiled their good mood. That’s a level of reasoning which is culturally and contextually-unaware, and indicative of the lines of reasoning that a computer may take.

It’s likely that these lines of reasoning will be the direct result of the interpretations of the masses, or potentially the engineers. They will be the ones that establish the de facto norms of the computer that will live in your AR glasses. In both cases, it’s the interpretations of large, homogenous groups which prevail — Interpretations of heterogenous groups, which contain URMs-racially, socially, and socioeconomically-and collectively constitute the world’s majority, will face complete erasure as a result. That’s scary, which is why we need to be thinking about D&I initiatives now.

Frameworks for Innovative Thought

A final thought, what will having ubiquitous AR glasses that can label objects mean for the future of innovative thought? If objects are labeled and ideas are inferred from them by computers before humans can consciously place labels themselves, potentially monumental leaps may be preemptively categorized as incremental steps because the little computer on our heads improperly categorized it as a pre-existing concept. In doing so, the computer pre-assigns and restricts the language we can use to have engaging discourse around the concept.

We are thus forced to be thinking about the future in the terms of the past, using ill-fitting labels that prevent us from looking at a novel object in a vacuum, free from the baggage of linguistic connotations.

I know that thought is a bit convoluted but hear me out.

Take the invention of the iPad, a tablet. If it was first released while AR glasses existed, then it would be mis-labeled as simply a large iPhone. That’s not incorrect to say, but the form factor of the iPad lends itself to a device whose use cases are neither completely in the realm of the phone nor the computer. If, in the creation of the iPad, it was labeled as a phone and people thought about it as simply a large phone, we would be unable to think of use cases with it that are unobtainable on a phone or a laptop. We would never have gotten a powerful, portable illustration device and the Apple Pencil.

Apple knew this and purposely named it the iPad rather than the iPhone XL.

A world perceived through computers will mean the labels, names/definitions of objects as classified by computers, will have an even greater impact on our society.

The idea of using AR glasses as a lens for us to better understand the world around us is an enticing one. I think it could have major benefits for our world — turning unskilled to semi-skilled workers in a fraction of the time, rapidly identifying lost items, and allowing disabled people to make use of their lost senses by imitating synesthesia. But there’s some unsettling bits too and we are starting to reach a point in its technological development where we need to be conscious of them. With paradigm-shifting technologies such as AR, avoiding these mass-consumer pitfalls will be key to successfully enabling widespread adoption and retention of this technology which has so much potential to do good for the world.

--

--