Motion Modeling in Videos: Part 1 — Optical Flow

14 min readApr 5, 2024

Image with flow vectors (left); Optical flow (middle); Motion segmentation (right)

Motion modeling seems to be an overlooked problem in computer vision, especially given all the progress with NeRF and implicit representation.

If we can combine motion modeling from 3D geometry with the current state-of-the-art NeRF, we may solve 3D reconstruction in dynamic environments for self-driving, AR/VR, etc.

Can we actually incorporate moving objects inside the 3D reconstruction framework and make it possible to directly reconstruct both static and dynamic environments within the same framework? Let’s find out!

Note this paper is a part of a series:

Motion Modeling in Videos: Part 1 — Optical Flow
Motion Modeling in Videos: Part 2 — Optical Expansion (link)
Motion Modeling in Videos: Part 3 — Scene Flow (link)

Background subtraction for motion segmentation

The simple way to model motion is through background subtraction. We can find the difference between the current and next images (to reduce noise, apply Gaussian smoothing to the images before subtraction). If the difference is greater than a certain threshold value, the pixel corresponds to a moving object; otherwise, it corresponds to static.

We consider two scenarios:

With static camera
With moving camera

1. Static camera scenario: green — moving pixels

In the first scenario, the results seem reasonable since we can achieve motion segmentation on the edges of the objects. However, we can’t fully segment them because homogeneous areas without textures will yield zero difference. Additionally, you may observe problems with shadows cast by cars, as well as reflections and illumination changes, which can lead to incorrect results.

2.Dynamic camera scenario: green — moving pixels

In the second scenario, the approach fails. Things become more complex in the case of general motion. We need a better way to model motion. This leads us to the concept of optical flow.

What is optical flow?

The optical flow represents the displacement of pixels (delta_x, delta_y) between the current and next images.

Optical flow example: forward flow (on the right) from image 1 to image 2

We can establish a direct mapping between pixels of image 1 and image 2 by employing optical flow (delta_x, delta_y):

image2(x2, y2) = image1(x1 + delta_x, y1 + delta_y)

Optical flow can be visualized with arrows or colors corresponding to directions on a color wheel (see the image below).

The optical flow visualized with special colors which corresponds to different directions (image courtesy prof. T. Brox)

Optical flow can represent various types of motions, and sometimes, by observing it, we can understand how the camera moves (see image below).

Different types of motion represented as optical flow: rotation around y-axis (top); rotation over z-axis (middle); forward translation (bottom) [1]

In the case of pure translation, there is a point where all flow vectors diverge. This point is called the focus of expansion (FoE) or contraction, depending on the motion. At this point, the optical flow equals zero.

The FoE example given different motion: a — forward motion; b -backward motion

The FoE is the projection of the center of the camera from the next frame to the previous frame. It is also called the epipole. In the case of pure translation, its image coordinates are easy to find from the translation vector T:

The FoE shows the direction of the motion of the camera.

if you want to know more about camera geometry check tutorial video here.

When dealing with general motion, accurately interpreting the optical flow arrows to assess the motion solely through observation can be challenging.

The example of the general motion optical flow [1]

The complexity of the flow vectors arises from the combined effects of translation and rotation, involving both static and dynamic objects.

We can decompose optical flow vectors by the translation and rotational part, resulted vectors is just sum of this components [1]

The overall optical flow vector comprises the following components:

For static objects: camera translation + camera rotation.
For dynamic objects: camera translation + camera rotation + dynamic object translation + dynamic object rotation.

Optical flow modeling encompasses all potential motion patterns and can serve as a viable option for motion segmentation

How to compute optical flow?

I will focus on state-of-the-art deep learning-based approaches, as they are both simple to understand and provide the best performance.

If you are more interested in classical optical flow formulation you may check video tutorial here.

Optical flow can be formulated as a supervised learning problem. A convolutional neural network (CNN) takes two consecutive images as input and outputs the respective optical flow. The ground truth optical flow can be generated from simulators (e.g., Carla [2]). The L2 distance loss function is typically used to measure the distance between the predicted flow and the ground truth.

The first supervised deep learning networks for optical flow were FlowNet [3] and FlowNet2 [4]. These works showed that it is possible to achieve state-of-the-art optical flow performance and generalize to real data while being trained only on simulated images.

PWCNet — efficient optical flow implementation

The first optimized version of optical flow is PWCNet [5]. It is 17 times smaller in size, 2 times faster, and easier to train than FlowNet2.

These improvements were achieved through the careful incorporation of best practices from classical approaches into the CNN architecture (see the image below).

The best practices from classical optical flow (on the left) transferred to the PWCNet architecture of the decoder (on the right)

In classical optical flow computation, the images are down-sampled to several different scales. The optical flow calculation begins at the coarsest scale. The resulting optical flow is then upscaled to the next level, and Image 2 is warped using this optical flow. Afterward, the flow is computed again.

The warping procedure enables the movement of pixels from Image 2 closer to Image 1, thereby reducing the size of the search window needed to find corresponding pixels. PWCNet applies the same workflow on features extracted from the encoder.

RAFT — optical flow with global search

RAFT: Recurrent All-Pairs Field Transforms [6] is the first work to utilize global matching search for correspondences with neural networks.

The correlation volume for whole image 1 lead to 4D correlation volume of the shape HxWxHxW and down-sampled versions after applying pooling operation

Global matching entails searching over all pixels in Image 2 for every pixel in Image 1. This results in quadratic complexity costs relative to the height and width of the images.

The RAFT architecture. Images processed by feature encoder and after dot product 4D correlation volume is constructed. Recurrent unit uses this correlation volume (and its down-sampled versions) and also context features of image1 to iteratively compute optical flow

The RAFT utilizes recurrent GRU unit to iteratively compute optical flow (see the image above).

Although the approach achieves state-of-the-art results (as of 2020), it exhibits quadratically growing memory demands relative to resolution.

As a result, its usage is limited, even as an offline solution. Furthermore, the PWCNet performs competitively with RAFT given careful tuning and training [7].

Unimatch: Unifying Flow, Stereo and Depth Estimation

Finally, the current state-of-the art (as for beginning 2024) unimatch [8] neural network, which incorporates transformers and enables the integration of tasks such as optical flow, stereo disparity, and multi-view depth.

Unimatch network architecture: As input it takes two images, features extracted with CNN encoders than several transformers with self-attention followed by cross-attention computes discriminative features for matching. After that non-learnable matching happened, followed by refinement self-attention.

This architecture boasts several advantages: it is simple, fast for inference, and integrates a shifted local window attention strategy from the Swin Transformer [9] for efficiency. It significantly outperforms RAFT and can be applied to both online and offline applications.

Self-supervised optical flow

For completeness, it’s worth mentioning that optical flow can be computed from raw videos without the need for any labels.

Given two images, optical flow can be computed and used to warp Image 2 to align with Image 1. The difference between the original and warped images is then measured using photometric loss. There is a comprehensive survey of these techniques [10], along with a state-of-the-art method as of 2021 [11].

Supervised optical flow trained with simulated data has shown excellent performance on real images, while self-supervised techniques can be valuable for fine-tuning models for further improvements.

Motion modeling with optical flow

Let’s consider the first simple use case where the camera is stationary and only surrounding objects are moving. In this scenario, the background optical flow should be zero, while optical flow for all moving objects should be non-zero. This straightforward observation can lead to reliable motion segmentation when high-quality optical flow is available.

In practice, it may be necessary to use a threshold value and estimate the uncertainty of the optical flow (see [12]) to filter out noisy flow vectors. Although there are works focused on single-object segmentation with a slightly moving camera, our primary interest lies in general motion segmentation.

Motion segmentation from optical flow with stand still camera: all non-zero flow marked as moving objects

For general motion segmentation, this simple approach would not suffice since optical flow exists both in static and dynamic objects. However, we can assume that static objects should exhibit consistent motion. If we could leverage this observation, we might be able to distinguish the static background from dynamic objects.

First, let’s examine the flow vectors in the image. Ideally, the static environment should have same flow vectors, while dynamic objects should have varying flow vectors. However, this assumption does not hold true.

Upon observation, we may notice that flow vectors for the same static environment differ, both in magnitude and direction. Therefore, a simple clustering approach is not suitable.

Same static object can have different optical flow vectors depending on pixels location

Why this happens?

The reason for this discrepancy lies in the fact that we are not directly observing the real 3D motion of the objects or the static environment, but rather their 2D perspective projection onto the image plane.

This projection effect leads to different flow vectors depending on the pixel positions to which they are projected. Consequently, we must incorporate 3D geometry to effectively distinguish between static and dynamic objects.

Epipolar constraints for motion modeling in optical flow

To utilize 3D geometry, it’s essential to have knowledge of camera calibration parameters. The intrinsic camera calibration parameters include:

Focal length for the width and height dimensions (fx, fy)
Principal point (the projection of the camera center onto the image plane) pixel coordinates (cx, cy). (For a perfect camera, cx = width/2.0, cy = height/2.0.)

These parameters allow us to construct the calibration matrix K.

This matrix can be received with calibration process (if you are interested how to do calibration check this tutorial.)

Given two cameras, O1 and O2, a static 3D point P corresponds to pixel positions p in camera O1 and p’ in camera O2. Connecting the centers of cameras O1 and O2 with point P forms a plane (gray triangle). The intersection of this plane with each camera’s image creates epipolar lines (blue lines). The projections of the camera centers onto each other’s planes, denoted as e and e’, are called epipoles or focus of expansion (FoE). All epipoles intersect at these points. [13]

Given the camera center position O1 and a static 3D point P, the observed pixel position of this 3D point in camera O1 image corresponds to pixel p. When the camera is moved to position O2, the new corresponding pixel position of point P is p’ = p + optical_flow.

Ideally, if we know the camera motion rotation matrix R, translation vector t, and depth Z for each pixel, we can compute the static optical flow from the camera’s 3D motion using the following matrix multiplication equation:

Given static flow we can subtract from full optical flow and theoretically it will give us moving objects.

The approach demands depth Z estimation which is not trivial task. Can we do something simpler?

Fundamental matrix estimation and Sampson error

The answer is yes!

We can utilize the assumption that correspondences for all static objects lie in a plane (see the image above). By connecting the camera centers O1 and O2 with the 3D point P, we define a plane (grey triangle).

The projection of this plane onto the image results in a line. Leveraging this fact, we can compute the relative motion between cameras.

This observation is known as the coplanarity constraint

The re-projection between corresponded point p and p’ depends on a special 3x3 matrix F which called fundamental matrix.

We can estimate fundamental matrix from optical flow as follows (see this tutorial for more details):

import numpy as np
import cv2
def compute_fundamental_matrix(flow):
    """Compute fundamental matrix.
    
    Args:
        flow: optical flow of the shape [height, width, 2 (dx, dy)]
    
    Returns:
        F:  3x3 fundamental matrix
    """

    # Get pixels coordinates of the image1
    height, width, _ = flow.shape
    xx, yy = np.meshgrid(range(width), range(height), indexing='xy')
    pts1 = np.asarray(np.stack((xx, yy), axis=-1), dtype=np.float32)

    # Compute corresponded coordinates in the image2
    pts2 = pts1 + flow

    # Compute fundamental matrix by using 8 point algorithm and ransac
    pts1  = np.reshape(pts1, (-1, 2))
    pts2  = np.reshape(pts2, (-1, 2))
    inlier_tr = 3.0
    ransac_prob = 0.999
    # Here output F fundamental matrix
    F, valid_mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, inlier_tr, ransac_prob)
    return F

The fundamental matrix allows to estimate coplanarity constraints for every optical flow points correspondence.

We assume that points which do not satisfy coplanarity (i.e., are not located in the same plane) are moving objects.

The distance from epipolar plane (or in image line) to the given point correspondence can be formulated as Sampson error[14]:

the python example of the error computation:

import numpy as np
import cv2
def sampson_error(F, pts1, pts2):
    """Estimate sampson error from coplanarity constraints.
    
    Args:
        F: fundamental matrix
        pts1: points on the first image of the shape [num_points, 2]
        pts2: corresponded points on the second image of the shape [num_points, 2]
    
    Returns:
        error: the coplanarity constraint error for each correspondence
        Note: for coplanar correspondence it should be zero (or near so)
    """

    num_points, _ = pts1.shape
    ones = np.ones((num_points, 1))

    hom_pts1 = np.concatenate((pts1, ones), axis=-1).T
    hom_pts2 = np.concatenate((pts2, ones), axis=-1).T
    Fp1 = F @ hom_pts1
    Fp2 = F.T @ hom_pts2
    p2Fp1 = (hom_pts2 * Fp1).sum(axis=0)

    error = (p2Fp1 **2) / (Fp1[0]**2 +Fp1[1]**2 + Fp2[0]**2 + Fp2[1]**2 + 1e-8)
    return error

The results can be seen on the animation below

The motion estimation with Sampson error: most of the pixels even for moving objects moves along the epipolar lines (see upper left) therefore can not be detected in motion mask (bottom right)

The reason why motion masking fails to separate moving from static objects is that our assumption is incorrect.

In reality, many motion patterns occur within the epipolar plane. For instance, collinear motion, such as the red truck in the image, or opposite-line cars, as well as motion toward the FoE, like the white car

The example coplanar moving object: the point moves withing epipolar plane (along epipolar line) therefore can’t be detected by Sampson error

We need something else to improve our detection.

The essential matrix estimation and camera motion modeling

What else can we do?

We need more information to handle motion segmentation. What if we calibrate the camera? Can intrinsic calibration (matrix K) help us to improve detection?

Given calibration of the camera we actually can estimate camera relative motion.

We can rewrite coplanarity constraints as follows:

The re-projection between corresponded point p and p’ depends on special 3x3 matrix E — essential matrix.

The essential matrix contains inside the rotation and translation of the camera and we can decompose it to get camera motion (up to scale).

The code for estimation of the camera motion:

import numpy as np
import cv2
def compute_camera_motion(flow, intrinsics):
    """Compute camera motion from essential matrix.

    Note: intrinsic matrix
            | fx 0 cx |
        K = | 0 fy cy |
            | 0  0  1 |
    Args:
      flow: optical flow of the shape [height, width, 2(dx, dy)]
      intrinsics: the intrinsics matrix K of the shape 3x3
    
    Returns:
      R: 3x3 rotation matrix
      t: translation vector
    """
    height, width, _ = flow.shape
    xx, yy = np.meshgrid(range(width), range(height), indexing='xy')
    pts1 = np.asarray(np.stack((xx, yy), axis=-1), dtype=np.float32)
    pts2 = pts1 + flow

    pts1 = np.reshape(pts1, (-1, 2))
    pts2 = np.reshape(pts2, (-1, 2))


    ones = np.ones((height*width, 1))

    hom_pts1 = np.concatenate((pts1, ones), axis=-1)
    hom_pts2 = np.concatenate((pts2, ones), axis=-1)

    norm_points1 = (np.linalg.inv(intrinsics) @ hom_pts1.T).T[:, :2]
    norm_points2 = (np.linalg.inv(intrinsics) @ hom_pts2.T).T[:, :2]


    # Compute essential matrix with 5 points algorithm, valid_mask -inliers
    E, valid_mask = cv2.findEssentialMat(norm_points1, norm_points2, focal=1.0, pp=(0., 0.), method=cv2.RANSAC, prob=0.999, threshold=3.0)
    # Here outputs
    # retval: A scalar value indicating the success of the operation. If it's true, the decomposition was successful.
    # R: The rotation matrix representing the relative rotation between the two camera views.
    # t: The translation vector representing the relative translation between the two camera views.
    # mask: A mask indicating the inliers. It's optional and may not be returned if not provided
    # Recover pose (rotation and translation)
    retval, R, t,  mask = cv2.recoverPose(E, norm_points1, norm_points2, focal=1.0, pp=(0.0, 0.0))
    return R, t

Now, given rotation, we can use it to compute rotational homography. The idea is that all static environment points will have the same rotation, but dynamic points will have different rotations. Therefore, if we remove the rotation by applying a homography transformation, we will observe errors on moving objects due to their additional motion. The rotational homography can be formulated as follows:

And the homography error can be formulated:

here d() — Euclidean distance, in python can be implemented as following:


import numpy as np
import cv2
def homography_error(pts1, pts2, R, intrinsics):
    """Rotational homography error.
    
    Args:
        pts1: undistorted pixels coords [num_points, 2]
        pts2: undistorted pixels coords [num_points, 2]
        R: rotation between cameras
        intrinsics: 3x3 intrinsics

    Returns:
        error: homography error
    """

    num_points, _ = pts1.shape
    ones = np.ones((num_points, 1))

    hom_pts1 = np.concatenate((pts1, ones), axis=-1).T
    hom_pts2 = np.concatenate((pts2, ones), axis=-1).T

    H =  intrinsics @ R @ np.linalg.inv(intrinsics)

    Hp2 = H @ hom_pts2

    Hp1 = np.linalg.inv(H) @ hom_pts1

    error = np.linalg.norm( (hom_pts1 -Hp2)[:2], axis=0) **2 + np.linalg.norm( (hom_pts2 -Hp1)[:2], axis=0) **2
    return error

The results is not good.

Homography error in case of general motion seems not the best way to go

Another possible trick we can use is estimating the focus of expansion (FoE) by knowing the motion direction. We can then remove rotation with homography from the flow and attempt to check the directions of the flow vectors with respect to the FoE.

In the case of backward motion, static vectors point toward the FoE; otherwise, they point in the opposite direction.

However, even with these tricks, it is not possible to model all types of collinear motion.

Analysis and Conclusions

We tried several different techniques to derive motion segmentation from optical flow but were not able to succeed. It is possible to use optical flow for motion segmentation with a stationary camera (or in some particular cases like single-object segmentation with minimal camera motion). However, it is impossible to derive general motion segmentation solely from pure optical flow.

Why is that?

Optical flow f(fx, fy) is the projection of the 3d motion vector (scene flow) to 2d plane (image a). During this projection the information is lost and it is not possible recover exact 3d motion vector for given 2d flow vector image source [15]

The optical flow is the projection of 3D motion onto a 2D plane, and many different motion patterns in 3D can correspond to the same flow. It turns out that optical flow alone is not sufficient to recover 3D motion. The problem is under-constrained, with a large degree of ambiguity.

Ideally, we need to model the full 3D scene flow to resolve general motion.

Literature

[1] Coursera: Robotics Perception course from Penn University

[2] Carla simulator: http://carla.org//

[3] FlowNet: Learning Optical Flow with Convolutional Networks: https://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15/flownet.pdf

[4] FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks: https://openaccess.thecvf.com/content_cvpr_2017/papers/Ilg_FlowNet_2.0_Evolution_CVPR_2017_paper.pdf

[5] PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume, https://arxiv.org/abs/1709.02371

[6] RAFT: Recurrent All-Pairs Field Transforms for
Optical Flow https://arxiv.org/pdf/2003.12039.pdf

[7] What Makes RAFT Better Than PWC-Net? https://www.researchgate.net/publication/359390010_What_Makes_RAFT_Better_Than_PWC-Net

[8] Unifying Flow, Stereo and Depth Estimation https://arxiv.org/abs/2211.05783

[9] Swin transformer: Hierarchical vision transformer using shifted
windows https://arxiv.org/abs/2103.14030

[10] What Matters in Unsupervised Optical Flow https://arxiv.org/abs/2006.04902

[11] SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping https://arxiv.org/abs/2105.07014

[12] Uncertainty Estimates and Multi-Hypotheses
Networks for Optical Flow https://openaccess.thecvf.com/content_ECCV_2018/papers/Eddy_Ilg_Uncertainty_Estimates_and_ECCV_2018_paper.pdf

[13] Epipolar geometry tutorial OpenCV: https://www.geeksforgeeks.org/python-opencv-epipolar-geometry/

[14] Learning to Segment Rigid Motions from Two Frames: https://arxiv.org/pdf/2101.03694.pdf

[15] Self-Supervised Monocular Scene Flow Estimation https://openaccess.thecvf.com/content_CVPR_2020/papers/Hur_Self-Supervised_Monocular_Scene_Flow_Estimation_CVPR_2020_paper.pdf

Motion Modeling in Videos: Part 1 — Optical Flow

Written by Logos