How to tame time in Computer Vision

Published in

Deelvin Machine Learning

6 min readFeb 16, 2021

Today I will talk about an interesting and quite simple experiment with time in the context of computer vision. We will attempt to understand how to make a car move from the future to the past, and what the difference is between a dynamic action movie and a narrative drama.

When we work with a picture everything is quite simple — it has height and width and we instantly see everything in it. It is worth noting that we “read” a picture in small areas, starting from regions of interest, and then moving to the periphery. A convolutional network works in the same way, but it reads a picture in blocks with sizes equal to a convolution kernel, starting from the corner of the picture and gradually shifting by a certain distance.

The situation is a little more complicated with a video. Gazing at one frame in a video we do not see and do not know what is happening in the next one. This is the case with every frame. This happens because we have a third dimension — time.

If we had three-dimensional data, we could move from our favorite 2D convolutions to 3D. This could be done, but with a video the time axis has two problems in comparison to width and height. First, sample size is not constant over time. For example, a movie with a constant height and width of 640x480 can have different duration. Even if the resolution changes, we can resize it without losing the plot idea. “Resizing” in time is a much more difficult task.

Second, the time component can be very long. For example, a 1.5 hour movie contains 135,000 frames (90 minutes * 60 seconds * 25 frames). It is problematic to process such a volume of 3D data at the same time. We’ll have to split the video into parts, and process each one separately. Moreover, each part will not have information about the others.

One possibility would be to use recurrent networks, for example, LSTM in conjunction with a feature extractor. However, in the current experiment we are interested in finding a more visual representation of time to better understand the principles of video sequence and 3D convolution.

In our experiment, we will use neither LSTMs nor 3D convolutions, and in general, we will forget about deep learning for a while. Today we ourselves will become a convolutional network.

As mentioned above, when adding time, the change of objects in frames along this axis becomes unobvious. How can one look at time at once? Let’s just swap time with width axis, leaving the height axis in place. So in one frame we will see everything that happened throughout the entire video, and the movement of the object across the scene along the width will be displayed as its transformation over time in the video sequence. Perhaps this sounds complicated? Let’s look at this in practice.

To do this, we need to collect all the frames sequentially. We will get a rectangular parallelepiped with dimensions Width x Height x Time. After transposition, we get a parallelepiped with dimensions Time x Height x Width. Time will be located along the width of the picture, Width will play the role of Time.

The pictures above are frames of the same video. The result looks a bit strange and I will explain this later. Now the main thing is to understand the principle of the transformation.

Let’s see how to implement this in python. Note, it is not a good idea to take a long or high resolution (FullHD, 4k) video right away. You may run out of RAM space. For a simple implementation, I work with all frames of the video sequence, since this requires swapping of the axes. And if you need to process a large amount of data, and they do not fit into RAM, then you will need to save all the frames on a hard disk and implement line-by-line reading.

import cv2
import numpy as npfilename = 'input_video.mp4'cap = cv2.VideoCapture(filename + '.mp4')
frameCount = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
frameWidth = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frameHeight = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
print(frameCount, frameHeight, frameWidth)

In the code below, we create a buffer and add all frames into it. The variable N is necessary so that not all frames, but every N-th one, get into the output buffer. This can be useful if you want to thin out a video sequence. With N equal to 1, we subtract all frames.

N = 1
buf = np.empty((int(frameCount/N), frameHeight, frameWidth, 3), np.dtype('uint8'))fc = 0
bc = 0
ret = Truewhile (fc < frameCount and ret):
    if (fc % N) == N-1:
        ret, buf[bc] = cap.read()
        bc += 1
    else:
        ret, _ = cap.read()
    fc += 1cap.release()

For the convenience of working with python and cv2, I use the reverse position of the Time x Height x Width axes. We perform the transformation around the height axis, swapping Time and Width, as in the code below:

buf = np.transpose(buf, (2, 1, 0, 3))

However, if you want to swap Time and Height, then the following transformation is used:

buf = np.transpose(buf, (1, 0, 2, 3))

Just a little hint: for the output video to be visually understandable, it is better to change the time axis with the axis along which the video moves. For example, if cars go from left to right, then we change Time with Width, if from top to bottom, then with Height.

Then you can either view it in the video renderer:

i = 0
while(True):
    frame = cv2.resize(buf[i], dsize=(frameWidth*3, frameHeight), interpolation=cv2.INTER_CUBIC)
    cv2.imshow('frame', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
    
    i += 1
    if (i >= buf.shape[0]):
        i = 0
cv2.destroyAllWindows()

Or save it as a file:

writer = cv2.VideoWriter(filename + '_result.mp4', cv2.VideoWriter_fourcc(*'MP4V'), 25, (frameWidth*3, frameHeight))
for i in range(buf.shape[0]):
    frame = cv2.resize(buf[i], dsize=(frameWidth*3, frameHeight), interpolation=cv2.INTER_CUBIC)
    writer.write(frame)
writer.release()

Please note that the width of the output video will be equal to the number of frames in the original video. Therefore, for ease of viewing, I scale the output video to the input resolution.

Let’s take a look at the results.

The video shows two cars passing quickly past walking people. Everyone who is standing still in the video is not visible in the output video. This way of projection of time shows everything, but we perceive only moving objects. Moreover, the faster the object moves, the shorter it is in the output video. The slower it moves, the wider it is. If the Object stops, it will be smeared across the entire width of the picture.

This can be seen very clearly in the next couple of videos. Note the two trains passing from below, compared to the cars.

In this video, more cars have already passed by and we see them all at once in the first frame already.

Interestingly, we scan from left to right. Therefore, everyone who goes in the same direction in the output video will move from back to front. This is clearly seen here:

Also in this case we will have the past on the left and the future on the right. This is due to the same scanning method from left to right. Some objects may appear to be moving from the future to the past. This is just their location at the specified times.

And here you can see how the picture from the window of the porthole unfolds into a panorama of the city with a parallax effect.

And what if you unfold not part of the city, but the whole Earth.

Here is another interesting example:

What do you think we will get as an output?

It’s also interesting to see how long streams look in this case. I took two trailers for the experiment, because they are longer than the videos above, about 2–3 minutes. Moreover, they are very different in content.

The first one is Shawshank Redemption (1994). The video shows the transitions between scenes. This is a very interesting effect. Thus, you can immediately divide the movie into shots in the first frame. I increased the frame width of the output video 3 times so that it was better seen.

The second one is the Iron Man (2008) movie. You can immediately see that there are a lot more shots in the action. The feeling is that the shots change every couple of seconds. This is modern action.

I think that’s enough for today. We were able to note very interesting things. I hope you enjoyed the experiment and will be able to apply this trick in your own work with videos.

We have many more interesting articles in our Deelvin Machine Learning blog. Hope you will be interested to read them. Also, don’t forget to visit our website deelvin.com

Videos were taken from www.pexels.com

How to tame time in Computer Vision

Written by Ildar Idrisov, PhD