In this first post, I want to talk about how one might go about augmenting an existing dataset for the purpose of evaluating motion tracking. We’ll go into details to understand how we can pick parameters in order to control the generated instances.
One of Uru’s pipelines involves augmenting regular videos with graphics, in a seamless manner. To do so, we have to understand the geometry of the scene, pick a host surface, and track that surface throughout the sequence of frames (and deal with occlusion in the process, too).
One of the critical steps of Uru’s Immersions technology is the computation of correspondences between adjacent frames. The set of points in correspondences between frame N and frame N+1 allows us to calculate the homography which describes the transformation of our graphics insertion.
The problem can be thought of as follows: imagine yourself taking a picture (call it image A) of the crosswalk above. Move over in any direction, then snap another picture (image B). If we were to augment image A with some graphics, how would the graphics be affected by the change in perspective in image B?
Motion tracking
To answer this, we first need to track the selected host surface. How does tracking work?
We first compute a set of sparse feature descriptors on the image. Those are locations in the image that are interesting, in the sense that they allow us to use them as anchors: corners of objects, textured surfaces…
Suppose we’ve found a strong feature in frame N — say the corner of a window. We now need to find that same window corner in frame N+1.
Because the camera is moving with respect to the scene, that window corner has a different set of image coordinates from frame N. One of the most famous tracking algorithm is Lucas Kanade, which produces a vector field for each feature. Each feature in frame N has a motion vector which points to its new location in frame N+1. Of course, the real world presents many challenges:
- Occlusion: the window corner in our example might be covered in frame N+1 by a foreground object
- Repeating patterns: if the window is surrounded by identical windows, there is a chance that the computed motion vector might point at a similar, but different window.
Different algorithms can be used, usually presenting a trade-off between accuracy and speed. This is where having a dataset is important: we want to know how changing one parameter will affect tracking for a bunch of videos. And, if we have a large enough dataset, perhaps we’ll be able to derive some useful statistics, for instance on when to terminate processing or whether drift has become significant.
Augmenting our dataset of host surfaces
For context: given two views of a physical object, our objective is to find how a set of points in the first view map to the second view.
As a reminder, a homography is a 3x3 matrix that can describe two situations:
- The transformation under a random camera motion of planar points between 2 views
- The transformation of any point if the camera motion is a pure rotation
Give 4 points in correspondence between 2 images, we can calculate the 9 elements of the homography matrix.
To augment our dataset, we’ll make the assumption of a small displacement which can be approximated by a pure rotation. We’ll generate random homographies and warp each frame of our sequences independently with those matrices.
Let’s pick an image:
We’ll warp this image with a randomly generated homography.
First, we’ll decompose a homography as the product of euclidean, affine and projective transformation matrices. We’ll see how each can be controlled to perform a small perturbation in the transformation of our graphics.
Euclidean transformation
H_e describes a euclidean transformation: rotation and translation. Angles and straight lines are preserved under euclidean transformations. Theta controls the angle of the rotation, Tx & Ty the amount of translation.
Affine transformation
Under an affine transformation, parallel lines are preserved, but angles are distorted. You can shear vertically and / or horizontally by adjusting s_x and s__y.
Projective transformation
Projective transformations map lines to lines, but parallelism is not preserved.
All three transformations multiplied gives us the final homography:
Drift
Using the above, we can also generate completely synthetic videos from a single image, with a warping caused by physically possible rotations, translations, scalings, shearings, and projective distortion.
We can store the homographies used to generate each frame in order to check whether the graphics augmentation is well anchored to its host surface. Tracking is an iterative process, so it is subject to drifting issues as time passes. We can compare the computed homography with the ground truth homography (randomly generated and stored) using statistics such as the reprojection error.