AR and VR: Enter the Metaverse

Augmented Reality (AR), Virtual Reality (VR), and how it may interact with Web3 and the Metaverse

Devjit Kanjilal
Margin_Squeeze
7 min readFeb 3, 2022

--

Zuck being MetaZuck

I sometimes forget to provide context, and last week, I forgot to mention that my article on Web3 was part one in a three-part series discussing the Metaverse.

We hear a lot about the “Metaverse” these days. If you’re like me, you probably find it exciting as the Metaverse is an entire ecosystem of new and innovative ideas that could change how humans communicate. But, it is also a little confusing. Learning about new technologies is never easy, so my goal is to use this three-part series to give a little context to make the Metaverse make sense.

We learned from part one of this series that a key aim for Web3 is to de-platform (known in this context as removing “walled gardens”) because de-platforming allows creators to own and monetize their content. Ownership and freedom are critical in making the Metaverse work, otherwise, it would be like having to establish a new user and avatar as you move between platforms. Web3 metaverse is a lot like a giant game of Sims, whereas a walled garden metaverse is one where everyone has their own disconnected version of Sims.

A metaverse is a network of 3D virtual worlds focused on social connection.[1][2][3] In futurism and science fiction, the term is often described as a hypothetical iteration of the Internet as a single, universal virtual world that is facilitated by the use of virtual and augmented reality headsets.[4][1]

Now, we that have an idea of universal connectivity to the Metaverse (hint: Web3), it’s also important to understand how we may be able to interact with it. This is where Augmented Reality (AR) and Virtual Reality (VR) come into play.

This week, for part two of my series, I will explain the computer vision technology behind AR and VR, discuss its current applications, and then discuss the future of the technology.

Computer Vision Explained

Computer vision is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information from digital images, videos and other visual inputs — and take actions or make recommendations based on that information. If AI enables computers to think, computer vision enables them to see, observe and understand.

In short, computer vision involves processing, analyzing, and using logic to make sense of visual data just like we do as humans. Instead of a brain, there is an algorithm, and optical sensors replace eyes. There are 4 main uses cases for computer vision.

  • Facial recognition As the title aptly states — this field involves using computer systems to recognize faces and identify them.
  • Object classification Parsing through visual content and classifying the photo/video object under a defined category. For example, hot dog? Or not a hot dog?
  • Object identification The first step before classification. Is the object even there? For example, Is there a hot dog in the image?
  • Object tracking Follow the hotdog once you identify it and classify it

A key hypothesis in how humans process visual stimuli is through pattern recognition. Computer vision is also based on pattern recognition and has become more effective in recent years stemming from advancements in artificial intelligence and growth in computing power.

To make computer vision work, we train computers on a massive amount of visual data (videos, images, etc), label them as objects, and find patterns. For example, if we send millions of images of hot dogs through a computer vision algorithm, the computer will analyze the images, identify patterns that are similar to all hot dogs, and will create a model specific to hot dogs. This model will identify, classify, and potentially even track hot dogs in an image or video that fits the model of a hot dog. This model will not recognize oranges, but will recognize “not hot dogs”.

Interestingly, what we see as a hot dog, is not as such for a computer vision model. A computer vision model views images & videos as pixels, each with a set of color values represented by a single 8-bit number, ranging from 0 (black) to 255 (white).

Abe Lincoln through a computer

Processing for computer vision is commonly done through three technologies; deep learning, convolutional neural networks (CNN)for images, and recurrent neural networks (RNN) for video.

Deep Learning is a type of artificial intelligence structured to function like the neurons that make up the human brain, While CNNs and RNNs are methodologies used to break visual objects down into pixels that are given tags to represent features. Convolutions (a very painful mathematical operation a past professor made me do by hand!) are performed on this data resulting in predictions of what the computer system is seeing.

CNNs and RNNs, process data as features. They don’t process visual objects in their entirety but break them down into components. For example, a CNN processing the image of a Zebra might first discern hard edges to outline the animal, discern the angle of the stripes, simple shapes, colors, etc, and then fill in missing information as it runs iterations for predictions. These iterations are critical checks run against known data sets to narrow down the model to process visual stimuli as accurately as humans do. I have not talked about the technicalities of the more complicated RNNs much but think of them as being able to process temporal (time series) information using different mathematical functions (activation function). CNNs are poor at temporal processing.

How a CNN works

What's AR and VR got to do with this?

Now that you have a general idea of how the technology works — it’s much easier to talk about AR and VR.

Augmented reality (AR) is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, sometimes across multiple sensory modalities, including visual, auditory, haptic, somatosensory and olfactory.

AR is the convergence of the virtual and physical worlds. AR consumes information from a visual stream (eg. your phone camera), processes the visual stream using CNN/RNN, makes inferences to what it is seeing, and then allows the users to interact with that data. Snapchat filters are a perfect and relatable example of AR — the deep learning models run by Snapchat recognize faces, and then the user can interact with the recognized face through a variety of different options.

Virtual reality on the other hand is not convergence, but divergence from the physical world.

Virtual reality (VR) is a simulated experience that can be similar to or completely different from the real world.

In the case of VR, computer vision is critical to understanding the relation of the user to the virtual environment. For example, deep learning models may track user gaze, headset spatial location, etc to create a spatial understanding of the user so that the virtual reality feels more real. There may even be an aspect similar to AR for interaction with the VR universe — for example moving a virtual object in VR by grasping at the air. Meta (formerly Facebook) is investigating adding gaze tracking to create a more realistic Metaverse experience as part of the upcoming Oculus line-up.

What do AR and VR mean for the Metaverse?

One of the most common perceptions of the Metaverse is that it will be a network of 3D virtual worlds focused on social connection. The technologies behind Web3 will likely serve as the plumbing behind making the Metaverse work (because of the openness it creates) while AR and VR may be the interaction model.

AR and VR are not simple technologies, and understanding the complexity of the computer vision technology behind it raises the important question of how to build a metaverse without platformization of the technology used to interact with it (AR and VR).

For example, even if walled gardens are removed and there is a cross-platform Web3 utopia where all can interact freely and control their content (perhaps through NFTs), users of the Metaverse will still need to buy likely branded hardware to interact with it. For example, Google and Apple are building AR gear and Facebook owns Oculus.

More importantly, while a lot of computer vision software is open source, putting it all together is not intuitive and similar to a Rasberry Pi — not everyone may want to tinker with building their own or use lower quality hardware (Clears throat…Acer).

I predict that if the Metaverse does happen, the same platforms of Apple, Google, Amazon, and Facebook that control Web2 will control the premium interaction model for the Metaverse despite it being built on Web3.

In summary, last week, we learned that Web3 aims to remove walled gardens and return ownership and monetization of content to internet users; and this week, we learned about how AR/VR can be used to deliver experiences on a Metaverse built on Web3! Next week, I will close off the series with a deeper overview of the Metaverse and some of the happenings, companies, and technologies in this space.

--

--