Motion Modeling in Videos: Part 3 — Scene Flow

4 min readJul 21, 2024

Left — motion instance segmentation; Right — optical flow (scene flow projection to image)

In the previous parts we investigated possibilities of the motion modeling in videos with optical flow and optical expansion and concluded that the only way to correctly model the dynamic environment is a scene flow.

Note this paper is a part of a series:

Motion Modeling in Videos: Part 1 — Optical Flow (link)
Motion Modeling in Videos: Part 2 — Optical Expansion (link)
Motion Modeling in Videos: Part 3 — Scene Flow

What is scene flow?

Scene flow is 3d motion S(Sx, Sy, Sz) of the 3d point P(Px, Py, Pz).

Scene flow example: The scene flow S(Sx, Sy, Sz) is a 3d motion of the point P to P’ position. The projection of the scene flow f(fx, fy) in 2d image is optical flow. Image source [1]

The rigid motion can be represented:

P’ = R*P + T

than scene flow

S = P’ — P = R*P -P +T = (R-I)*P +T

Here R — 3x3 rotation matrix, I -3x3 identity matrix and T — translation vector.

The rotation matrix R represented as following:

Lets substitute this matrix into our scene flow equation and get final scene flow relation of motion in 3D:

The relation between scene flow and pinhole camera model

The pinhole camera model is simplified model of the projection of the 3d point to 2d plane with assumption that there are no lenses (for more information see tutorial here):

The schematic representation of the pinhole projection: the example of projection of the 3d point X to the image plane ends up at pixels coordinate x (on the left). The example of relations between 3d point X the Y coordinate and focal length f (on the right). The equations of the projection of 3D point X(X, Y, Z) to pixels coordinate (x, y) in image (bottom of the image)

The projection of the 3D point X with coordinates (X, Y, Z) to the image plane with pixels coordinates (x, y) can be represented as equation in the picture above(where X , Y, Z — 3d coordinates, x, y — 2d coordinates in the image plane and f-focal length). It is important to note that after projection with this equation we will have the pixels coordinates origin (0, 0) in the principle point p(cx, cy) (around the center of the image). If we want to transfer it to the image coordinates with origin at top left corner we should shift all pixels coordinates by p(cx, cy).

We can derive the dependencies to the scene flow as following:

Note: here flow = (u, v) — optical flow in x direction and y direction.

We can observe that scene flow can be estimated from optical flow, motion in depth and depth for current image.

Deep learning based motion estimation from scene flow

In the paper [3], the authors first employ three separate networks to compute depth from monocular images, optical flow, and motion in depth. Additionally, they compute four cost tensors: homography error, Sampson error, plane plus parallax, and depth contrast error (see the appendix of paper [3] for details).

The authors also estimate camera motion by calculating the essential matrix and remove camera rotation from the scene flow, which is reconstructed from depth and optical flow, as this simplifies the task of rigid body masking.

They then use the network to predict a binary mask distinguishing between static and dynamic environments, as well as identifying instances of moving objects.