Camera motion estimation using optical flow

Discussing the concept of differentiating basic camera moves with OpenCV while walking through the code

Ivan Kunyankin
6 min readJul 9, 2020
Photo by Raul Hender on Unsplash

Recently, I participated in the development of a tool that was supposed to extract different visual information from the video. Target information included people’s silhouettes, prevailing colors, and camera moves (e.g. zoom or dolly). And while the first two tasks were pretty straightforward the latter turned out to be more challenging.

The idea was to find a way to determine the direction, velocity, and the type of camera movement on a video. One obvious solution here was to use optical flow. And although it worked well with direction and velocity, I had no idea how to differentiate different moves. It seemed that I should look at the task from a different perspective.

I hope that this post will be helpful for those struggling with the same or similar task.

Let’s start with listing what this post is about:

  1. Basic camera moves
  2. The core idea of differentiating moves
  3. Walking through the code

Basic camera moves

From my non-professional point of view, despite there are multiple known camera movements, we should outline three of them:

  • One when the camera is mounted on a track and moved — truck, dolly, pedestal
  • One where the camera stays in the same place and rotates — pan, tilt
  • One where we’re changing the focal length of a zoom lens — zoom

You can follow this link to see how the listed moves differ visually from each other.

We can go even further and keep just two of them:

  • The camera moves — truck, dolly, pedestal
Example of truck move. The camera moves in parallel with the wall (source video by author)
  • The camera stays in the same place — zoom, pan, tilt
Example of panning. The camera is rotating (source video by author)

In this post, we will cover the concept of distinguishing moves using a pan vs. truck example. For other pairs (like zoom vs. dolly), one will probably need a different approach.

The core idea of differentiating moves

When I was speaking of a different perspective earlier, I meant that before writing the code, we need to figure out how we can explain the difference between panning and trucking. With these two, it is not that obvious.

When trucking, all objects in the scene move at the same speed. Whereas, while panning objects that are closer to the camera move faster than those that are further.

So the idea is to compare how different velocity is for different objects on the video. Here’s my approach step-by-step:

  • Calculate optical flow and translation values for each point on a pair of frames
  • Cut off some of the values from both sides of sorted translation values — to get rid of outliers
  • Cluster the rest of the values (I use 3 clusters to separate values further from each other)
  • Calculate the difference between the minimum and maximum cluster centers
  • If the difference is higher than some threshold — we have panning happening on the video, otherwise — trucking

You can find the implementation of this approach here.

It works surprisingly well but unfortunately has its limitations.

Some objects on a video can be very tricky in terms of using optical flow analysis — for example, sky or water. These objects don’t give the algorithm exact point positions to monitor.

Example of an image a big part of which is hard to track. Photo by Gianni Loginov on Unsplash

It also suffers from moving objects in the video — for example, If a video contains a crowd running around or just one person in the close shot. So basically, it suffers when there is a lot of points moving at a different speed in different directions.

A lot of chaotic movement can confuse the model. Photo by William Krause on Unsplash

One way of fighting this is to remove moving objects from frames. You can do that by masking undesired regions using an image segmentation model. Meaning, we do not take into account the optical flow of the regions containing these objects.

Last but not least, the greater distance between the foreground and background, the better the algorithm works. When shooting close to a wall, it’d be hard to define the movement even for a human. It is presented well in the footage examples above.

Walking through the code

Brief beginner’s guide to video reading and writing with OpenCV

This section would be helpful for those interested in the implementation of the idea and those who start learning OpenCV. Again, the actual code for this post can be found here.

The code contains the calculation of the mode angle and translation value for every point in frames. These can be used to determine the direction and velocity of the camera movement. By the way, while calculating the angles, we keep in mind the position of origin being in the left bottom corner (instead of the upper left corner as in OpenCV) — it seems to be more convenient.

Let’s assume we want to process a video and determine how the camera is moving. We want to write a new video to write the results of our algorithm’s work (or draw optical flow itself) right onto the video.

First of all, we need to capture a video from which we will read frames.

cap = cv.VideoCapture(os.path.join(path, filename))

This class takes either the path to the video file or the ID of the camera as an input. To read frames from the object, we use this:

ret, frame = cap.read()

This method returns two values. The first is a bool value that is True if reading was successful (otherwise False) and the second is the returned value (None if the camera is disconnected, or there are no more frames in the video file, for example)

We can also call the ‘get’ method with the same object. This method can give us a whole lot of information. For example, we can extract the FPS value of the original video.

cap.get(cv.CAP_PROP_FPS)

Here is the full list of the information we can extract with this method.

outputStream = cv.VideoWriter(save_name, codec, frameRate, (int(cap.get(3)),int(cap.get(4))))

We also need to initialise our VideoWriter class object beforehand. We will need to extract the fps value and frame size values from the original video. Choosing the correct codec can be a little bit tricky. If your written video is empty — you can experiment by tweaking either codec or the output video format. Here’s the OpenCV tutorial on saving a video. For me worked the following values:

fps = int(cap.get(cv.CAP_PROP_FPS))
codec = cv.VideoWriter_fourcc(*'XVID')
frame_size = (int(cap.get(3)),int(cap.get(4))
outputStream = cv.VideoWriter("video.avi", codec, fps, frame_size)

Finally, we’ve come to something interesting. There are several algorithms available to calculate optical flow. We’ll use the Gunnar Farneback’s algorithm to calculate dense optical flow. If you want to understand the details of how the algorithm works, you can read this article.

flow = cv.calcOpticalFlowFarneback(img1, img2, None, 0.5, 3, 15, 3, 5, 1.2, 0)

This algorithm requires several parameters to be specified. You can either experiment choosing values that fit your situation or, you can use the provided set of values to iterate through the whole process faster. Here is the description of each parameter.

You can find the rest of the code here. Meanwhile, here is the results for two different videos. The conditions were almost ideal, so it is no surprise the algorithm worked correctly.

Example of processed trucking footage (source video by author)
Example of processed panning footage (source video by author)

I have presented a principal idea of differentiating different camera moves. The code can be used as a starting point rather than the final solution.

I want to thank Maximilian Niemann for sharing with me some insights on working with videos and images. It was a great experience!

I hope this little guide will be useful for someone. Please let me know if you have any questions. You can also reach out to me via LinkedIn.

--

--