Basics of stereoscopic imaging in virtual and augmented reality systems

9 min readOct 13, 2016

Still stereoscopic frame from Avatar movie (2009)

Virtual Reality (VR) presents a virtual world where the user is immersed into. While, Augmented Reality (AR) reinforces the real world with Computer Generated (CG) objects. Basically, VR and AR both depend on virtual objects which either simulate or are integrated into the real world. Although the substantial usage of virtual objects, VR and AR are considered multi-disciplinary fields rather than merely creating virtual objects. Both VR and AR are essentially based on computer vision techniques which are mainly used in VR environments to allow the user to interact with the virtual world. In AR systems, computer vision techniques are usually used in the registration process, where CG objects are placed steadily in the real world [1, 2].

Virtual Reality

In spite of the early emergence of VR, there are still many researches in this field. After the remarkable development of hardware devices, many VR systems were presented. M. Rheiner [3] presented Birdly which is a flight simulator that is controlled by the user’s hands and arms to imitate birds’ flight using sensory-motor. The user’s movements are conveyed to the software program that projects a virtual landscape to Head-Mounted Display (HMD) which is worn by the user. Another example of an impressive virtual reality system is the indoor skydiving system [4] which is a virtual jump simulator that mimics the skydiving in an indoor environment. The system relies on HMD, a headphone set and a motion capture system.

Augmented Reality

Instead of immersing the user into a virtual world, AR supplements the real world with CG objects, where each AR system must:

1- augment the real environment with CG objects,

2- register CG and real objects with each other,

3- response interactively in real time [1].

Thus, AR is considered a multi-disciplinary field which requires: hardware equipment, tracking sensors and/or techniques, CG objects rendering techniques, and User Interfaces (UIs). Those requirements must satisfy the three mentioned main principles in AR systems [1]. Hardware equipment, which is related to AR systems, has largely evolved over the last two decades, starting from NaviCam which is the first handheld and ID recognition AR system in the world developed in 1994–1995 through the Glasstron (a series of optical HMDs) to Google Glass in 2012. In addition, depth sensor such as Microsoft Kinect is used to make AR system more interactive [5]. On the other hand, many AR SDKs, which are based on computer vision and computer graphics techniques, were presented to facilitate the tracking, registration, and rendering processes [5]. Eventually, virtual UIs are used instead of WIMP (Windows, Icons, Menus and a Pointing device) in AR systems [1].

Nowadays, AR is used in many applications, such as navigation, industrial, military, medical purposes, and entertainment [1]. Furthermore, AR has a significant effect on educational purposes to enhance the students’ laboratory skills [6].

One of the most important factors in building up a successful virtual or augmented reality system is the way that the subject (you) perceives the CG augmented object, in the case of augmented reality, or the virtual world, in the case of virtual reality.

Visual Perception

Visual perception is the ability to recognize the surrounding environment by processing the visible visual information. God gave humans two eyes that cooperate in sight and play an important role in the depth perception. The depth perception is constructed by your brain using the two synchronized different images that your eyes perceived. Why different? Because of the eye distance, since your right and left eyes are not located in the same position on your face, except that you are one of the the Arimaspi!

As aforementioned, the eye distance between the left and right eyes yields receiving two different images, that leads to something called Parallax which is the ability to see an object in two different ways.

Monocular vs. stereo cues

To distinguish between monocular and stereo cues, we can simply say that:

1- Monocular: single eye

2- Stereo cues: two eyes

Surely, the stereo cues, using both eyes, give you more depth information; but, you can still estimate some depth information from a single photo (monocular even using your both eyes), but you also may be deceived! For instance, please focus on the figure extracted from [7]. Trust me; the two yellow lines have the same length; that is exactly true for the two line segments in the right figure.

Stereo depth cues

The following figure illustrates how your two eyes cooperate to give you the depth perception. Thus, your eyes play the key role in the stereo depth cues.

*This Figure is presented by Rainer Zenz, T is the horopter (theoretically) and E refers to the empirical horopter [2].*

Stereoscopic technology

Stereoscopic technology is a technology that aims to give you the illusion of depth by mimicking the real world. In other words, we want to make our eyes perceive two different CG, or even captured, footage likes what our eyes do to perceive real objects in our lives.

Transformations

If you had studied a computer graphics course, stuff that you’d typically learn on a graphics course is how the transformations are performed in the virtual world. However, there are some modifications should be performed when you dealing with stereoscopy.

Where, Tvp is the viewport transformation matrix, Tcan is the canonical view transform, Tleft and Tright are the transformation matrices of shift (right and left), Teye is the viewing transformation matrix, Trb is the rigid body transform matrix [2].

t refers to the distance between the left and right eyes (interaxial distance or also known as interocular distance) which varies among people (average = 0.064 meters [7]).

Thereby, the additional shift transformations are applied to mimic the real perception of our eyes. Then, the brain estimates the depth of the received visual information. So, the stereoscopic technology is mainly based on intimating the real fact which is that our two eyes receive different two images. After that, the brain completes the task automatically.

Parallax

Again, Parallax is the ability to see an object in two different ways based on the eye distance. That yields to our depth perception. There are three types of parallax based on the intersection point of the two eyes. See the following figures.

Toe-in (converging camera targets)

If you manipulate the cameras’ frustums in the following way, that leads to something called toe-in which is an incorrect way to prepare the two virtual (or real) cameras in order to create a stereoscopy scene.

The Off-axis, also called the parallel camera, is the correct way to create a convergence area between the two cameras.

In that way, asymmetric camera frustum is used to make both frustums converging at a convergence distance.

Each pair of images (left eye and right eye images) is containing overlapping parts (almost the whole image with a little bit of shifting), that create the 3D image or video. Tip: the composting process is done by several ways based on the type of the screen and 3D glasses.

Some hardware devices require stereo-pair (side by side) images instead of overlapped images.

Comfortable stereoscopy

There are some important points should be taken into account to create a comfortable stereoscopy scene. Screen size is an important factor; specifically, the horizontal image size. The second one is the viewer distance from the screen. In the case of wide areas (theatres or cinemas), it is recommended to take the middle location as a reference to calculate the viewer distance. Finally, the last factor is the interocular distance (eye distance) that varies among people, so we use the average value 0.064 meters.

The idea is to maintain some mathematical ratios between the real world and the virtual world in order to mimic the real world. How to setup the virtual cameras to get a comfortable stereoscopy? To answer this question, the following ratio must be maintained:

Horizontal image size/ eye distance = CG convergence plane horizontal size/ interocular distance (interaxial)

Viewer distance/ eye distance = CG convergence plane distance/ interaxial

CG Interaxial distance=conv. plane distance × eye distance (real)/ viewer distance

To adjust the plane size, you should use the appropriate field of view angle (FOV) given by:

FOV=2⁡ atan(conv. plane horizontal size/2 ×conv. plane distance)

Comfort zone

Any object outside the comfort zone is rendered in undesirable look; but, what do we mean by the comfort zone? The front comfort zone is the half of the convergence plane distance and the back comfort zone is the equal distance in the opposite direction behind the convergence plane.

To increase the comfort zone, you can manipulate the value of interaxial distance. But remember, reducing the interaxial distance leads to reducing the stereo effect.

Stereoscopy hardware equipment

To see the power of the stereoscopy imaging, you should have specific hardware equipment that separate the two received images, one for the left eye and the second one for the right eye. Remember that, your brain must receive different images from each eye to give you the perception of 3D depth. The hardware side consists of two main items, which are the screen or the projector, and the glasses. There are many types of glasses for that purpose, depending on the type of the emitter (i.e. screen or projector)

Anaglyph 3D glasses

The idea is that the glasses separate the received 3D images which contain two filtered colored images (e.g. red and cyan images). The red is used for the left eye and the cyan is used for the right eye. This type is considered a cheap one, but gives a low 3D quality.

Active 3D glasses

The glasses are synchronized with the system (e.g. 3D projector) that emits a high frame rate video (double the regular frame rate). For instance, the first frame for the left eye and the second one for the right eye, and so on. The glasses firstly block the right eye (odd frames), then it blocks the left eye (even frames). Although this type is an expensive one and consequently generates good 3D quality, it causes flickering.

Polarized 3D glasses

This is the most common type which is used in cinemas. It passes the intended frame to the correspondence eye based on the light polarization (direction). Thus, a specific 3D projector or display is required to emit the polarized light. The polarized 3D glasses are cheap, give good 3D effect, but require a specific projector(s).

References

[1] DWF Van Krevelen and R Poelman. A survey of augmented reality technologies, applications and limitations. International Journal of Virtual Reality, 9(2):1, 2010.

[2] Oliver Bimber and Ramesh Raskar. Spatial augmented reality: merging real and virtual worlds. CRC Press, 2005.

[3] Max Rheiner. Birdly an attempt to fly. In ACM SIGGRAPH 2014 Emerging Technologies, page 3. ACM, 2014.

[4] Horst Eidenberger and Annette Mossel. Indoor skydiving in immersive virtual reality with embedded storytelling. In Proceedings of the 21st ACM Symposium on Virtual Reality Software and Technology, pages 9–12. ACM, 2015.

[5] Clemens Arth, Raphael Grasset, Lukas Gruber, Tobias Langlotz, Alessandro Mulloni, and DanielWagner. The history of mobile augmented reality. Technical Report-Institute for Computer Graphics and Vision, 2015.

[6] Murat Akc¸ayır, G¨okc¸e Akc¸ayır, H¨ useyin Mirac¸ Pektas¸, and Mehmet Akif Ocak. Augmented reality in science laboratories: The effects of augmented reality on university students laboratory skills and attitudes toward science laboratories. Computers in Human Behavior, 57:334–342, 2016.

[7] LaValle, Steven M. “Virtual reality.” Champaign (IL): University of Illinois (2016).‏

Basics of stereoscopic imaging in virtual and augmented reality systems

Written by Mahmoud Afifi