Rendering Objects in Augmented Reality with Metal in IOS

Published in

Tech at Trax

7 min readMay 3, 2020

NOTE: before you dive into this, you need to be familiar with: Swift, IOS development, ARKit, basic linear algebra, basic 2D metal rendering.

If you are, then go grab yourself a cup of coffee and come back, I’ll wait for you!

My story begins with the Trax Retail App. In short we use our crowdsourcing app to gather data about retail stores by taking pictures of the store shelves. We use these pictures to extract (using complicated machine learning algorithms) valuable data that retailers can use in order to know and monitor what’s happening in their stores.

We had one major problem - users were uploading pictures with bad angles, in order to enhance the user’s experience, we decided to integrate AR in our app, and we saw that users that use AR had better quality pictures!

The Trax App

You open the AR camera, let it detect the shelf

After the AR session detects the shelf, the user can start taking pictures:

With each picture the user takes, we draw a rectangle (plain) on the shelf to mark the area that the user covered:

In our app we use SceneKit to render these turquoise walls, but because I love challenges, I found myself looking at how to render objects in AR using Apple’s Metal. I had a lot of challenges doing it, due to the lack of proper and detailed documentation from Apple (as always), and the surprisingly poor and old material on the internet, yet I loved every single moment of it.

Moving to Trax and the joy of working on this part of our app made me very passionate about it.

So if you are a challenge lover, and you like to go deep and understand how these libraries work, then you’ve come to the right place!

Math Background

Handedness

Handedness refers to the orientation of the z-axis in a given 3D space. If the z-axis conforms to the so-called right-hand rule, then the space is said to be right-handed. Alternatively, the space is left-handed.

We can use both, as long as we understand this and know how to move between one and the other.

Linear Transformations

Geometric transformation is a function that maps one point to another point.

Linear transformation (also called linear mapping, linear function) is a mapping function.

V → W between two modules (for example, two vector spaces) that preserve (in the sense defined below) the operations of addition and scalar multiplication — wikipedia

f(u + v) = f(u) + f(v) //addition

f(c*u) =c*f(u), c is scalar // scalar multiplication

Each linear transformation can be written as a matrix.

The most common transformations in computer graphics are Scale, Rotation and Translation.

Scale and Rotation are linear transformations (we won’t be covering proofs here, or it’d be a course in linear algebra 😅). After playing with Translation a little, you’ll find out that it’s not a linear transformation, so we need a trick to make it work.

Scale

Can be represented by the following 3X3 matrix

Rotation

If u is the direction vector and the angle (in radians), then the rotation matrix should look like this…

We won’t be covering the source of this matrix here, but you are free to check out this wikipedia post which explains it very well.

Translation

In order to translate the point p by the normalized translation vector t, we need this simple equation

We can’t make a 3X3 matrix that we multiply with the point, to get this result.

To do this we will use a handy and widely known trick, we’ll add a new dimension w, and make it equal 1, so we’ll have this 4X4 matrix.

So when you multiply it with a point (with addition of w dimension), you get a 4D point, and by removing the fourth dimension we get our desired point.

The above linear transformation (in 4D) is called a Shear transformation.

Projection

One of our goals in 3D rendering is to “squash” 3D objects into 2D, and create the perspective illusion. We achieve this by creating a perspective projection matrix.

To understand how a perspective projection matrix works, you need this:

This is called the view frustum. By choosing an aspect ratio (ratio between width and height of viewport) and a field of view, we implicitly determine the clipping planes that make up the sides of the view frustum.

The pyramidal shape of the frustum is a natural outcome of our demand for perspective. The perspective projection scales point relative to their distance from the virtual camera, which in turn causes the sloped sides of the view frustum to become parallel as it undergoes this transformation. Points far away are squeezed closer to the axis of view, which causes the phenomenon of foreshortening, an important depth cue.

Clip Space

The device has it’s coordinate system, and it expects the given points to render, to respect this coordinate system: x[-1,1], y[-1,1], z[0,1]. This is called the clip space of the system. It’s where the system determines which triangles should be shown, which are clipped, which are hidden.

Now that we know what the clip space is, we can construct the projection matrix that scales points from one clip space to another :

All the symbols refer to the view frustum illustration.

The construction of this matrix is beyond the scope of this post, but you’re free to check this and this wikipedia post about it.

Code

To put this into practice, we’ll make a little app that puts a 3D cube on the location you clicked every time you click on the screen.

We won’t cover the Metal setup and the Assets setup here, since I am assuming you know all these things.

We’ll start with the Vertex Data function (functions we set of necessary data for drawing) before we pass the data to the shaders.

The first interesting function is the one where we set the necessary matrices, this is how we define our frustum:

You can see that at the start of the function, we set the view matrix and the projection matrix — by setting view port size, near and far properties we talked about.

The other lines are for setting the light, in order to give our shapes 3D lighting.

We won’t be covering this here, but you’re welcome to read about it.

Next, we’ll update the anchors:

We’re going through the anchors here, and from these anchors we create a model matrix — the matrix that defines the object we want to draw as a result of your click on the screen.

That’s the two main functions we use to set the data for drawing (Vertex Data stage).

Wasn’t really as tough and scary as you thought it would be! Right!? Now that you understand the whole purpose of this “complicated” mathematics, you can now connect the dots to figure out why we need all these matrices.

Next we’re moving to the shader functions, this is where we apply all the mathematics we’ve learned.

main vertex function

Here we simply apply each vertex position with projection, model and any other transformations (Translation, Rotation, Scale). We also set the Eye Position (frustum top point).

Don’t worry about the lighting properties, they’re just there to make things look better, we won’t be covering the lighting here.

The surprise is that you can find the whole source code if you open a new project in your Xcode and select the Augmented Reality App, and then choose Metal for Content Technology in the next screen.

But now instead of starting from zero, you can focus on the important parts — Yes, the parts we covered here — and leave the “complicated” technical setup for later.

From this point you can start playing with AR using Metal, trying to create a more interactive app, handling clicks and swipes on these cubes you create, and trying to Rotate or Scale them.

In short… HAVE FUN!