Ray Tracing from Scratch — Advanced 3D Image Data Augmentation in Python
Ray tracing is a 3D rendering technique commonly known from computer games. However, a simple 3D rendering algorithm may also be useful to improve generalization power if applied to problems in Computer Vision using Deep Learning, for example see DeTone et al. (2016): Deep Image Homography Estimation.
In this article the basic idea and mathematical background of a simple ray tracing algorithm is explained. To illustrate the idea, a sample code written in Python and primarily based on NumPy is made available on GitHub. Rendering surfaces is simplified using OpenCV’s perspective transformation and warping features.
The randomly generated images below show what is actually possible in Python without using any heavyweight dependencies like OpenGL. The images were generated by the ray tracing algorithm discussed in this article. However, please note that shading and motion effects are part of a more advanced image data augmentation engine which is not made available on GitHub.
In the following picture, the basic ray tracing technique is illustrated in a simple way. Please note that you should have some knowledge of geometry to understand how to derive the equations required for ray tracing.
As shown above, the screen is respresented as a plane. The vertex V is any vertex of an object in the three dimensional space. V is seperated from the viewpoint Z by the screen plane. The distance between the viewpoint and the screen plane is called the focal length. It directly depends on the field of view FOV. In general, the viewpoint approaches the screen plane when the angle of view α is increased and vice versa.
To display the object vertex V on a two dimensional screen, we have to project V on the screen plane. However, initially we don’t know the coordinates of V on the screen plane. Therefore, we create an imaginary ray line from the viewpoint Z to each vertex V in the space and determine the intersection point S on the screen plane.
In the first place, we have to determine the focal length. We can derive the focal length in pixels given the diagonal angle of view α and the screen width w and screen height h in pixels.
In the formula above, the variable c is the length of the screen diagonal in pixels. The connection of the screen diagonal and the viewpoint Z forms an isosceles triangle. Please note that the focal length in pixels corresponds to the height of the isosceles triangle.
In the first step we have to determine the length of the side a. Please remember that the two legs in an isosceles triangle always have the same length. If we know a and c, we are able to calculate the focal length in pixels.
However, it is easier to specify the focal length in physical units. For example it is quite common for many smartphone cameras to have a focal length of 26 mm. Therefore, we have to calculate the pixel per physical unit ppu constant to convert pixel values to physical units and vice versa.
Equation of the screen plane
Next, we have to derive the equation of the screen plane. In this article the center of the screen is the coordinate origin. To derive the equation of the plane, we have to know the coordinates of exactly three points A, B and C on the plane.
The first vector is called the support vector r. The support vector equals the vector between the coordinate origin and the point A (red).
To determine the orientation of the plane in the three dimensional space, we need two additional vectors. These vectors are called direction vectors and are illustrated in the picture below by the green arrows AB and AC.
This is what we need to know to write down the equation of the screen plane in the so called parametric form. Each point on the plane is described by any combination of the parameters λ and μ.
However, most of the time it is easier to work with the coordinate form of a plane. The coordinate form of a plane can be derived by calculating the normal vector. The normal vector of a plane equals the cross product of the two direction vectors.
Writing down the coordinates of the normal vector as a linear equation gives us the coordinate form of a plane.
Equation of the ray line
As stated above, we have to create an imaginary ray from the viewpoint Z to each vertex V in the space to determine the intersection point S on the screen plane. Therefore, we have to derive the equation of the ray line between the viewpoint Z and the vertex V. In general, a line is expressed by the support vector p and a single direction vector u.
The projection of vertex V on the screen plane E is the intersection point S of the ray line g and the screen plane E.
To determine the intersection point S, we have to replace x in E by g.
Next, we rearrange the equation to get λ.
Furthermore, we replace x in E by the support vector r to resolve γ.
Please note that we can rewrite the terms in the equation using the dot product.
Now that we know how to get the value for λ, it is easy to determine the coordinates of the intersection point S on the screen plane by simply replacing the value of λ in the ray line equation g.
The first two coordinates of the intersection point S correspond to the projected pixel position of vertex V on our screen. Since we do not need the value of the third dimension, we can simply drop it.
Implementation in Python
The code to illustrate the idea of this article is shown below and has some extra features. For example, you can apply some linear transformations to the vertex coordinates in the three dimensional space (e.g. rotation and translation) and draw a surface. It is also easy to specify the image size in physical units and scale it accordingly.
For illustration purposes, the grid of four vertices is shown in the picture below. We choose a camera with 26 mm focal length and 77 degrees for the diagonal angle of view. The coordinates of the vertices are specified in physical units (400 mm in front of the screen plane). Each vertex is rotated by -35 degrees in x-direction and 20 degrees in y-direction.
Drawing a surface is simplified using OpenCV’s perspective transformation and warping features. We simply have to change the variable surface to True in the code above.
surface = True
The code is optimized for speed. I was able to render about 170 frames per second with an Intel Core i5 CPU (3.20 GHz).
Thank you for reading this article. If you liked my article, I would appreciate it if you show it :)
DeTone, Daniel; Malisiewicz, Tomasz; Rabinovich, Andrew (2016): Deep image homography estimation. arXiv preprint arXiv:1606.03798.