Phase Based Motion Magnification: Part 2

How to detect imperceptible motions in video

Isaac Berrios
11 min readJan 30, 2024

Everyday objects produce subtle motions that are undetectable to the human eye. In a video, these motions are less than a pixel in magnitude however, we can reveal these invisible motions with Phase Based Video Magnification, and in this post we will code it from scratch in Python! If you’re interested in the the Theory, we covered that in the previous post. A sample of the results is shown in figure 6.

Photo by Guillermo Ferla on Unsplash

Algorithm

We covered a high level algorithm in part 1, but it really didn’t provide enough detail for implementation. The algorithm we will use is shown below.

Motion Magnification Algorithm. Source: Author.

The majority of the processing occurs in step 3, the part where we make the actual motion magnification is in blue colored text. The Phase Variations correspond to local motions in the video frame sequence. In the next three sections we will break down the algorithm and implement it from scratch.

Preparing for the Processing

In this section, we will cover steps 0–2.

The first task to is obtain the video frames and for the sake of simplicity we will just load them all into memory. We will also need to get the video sample rate, this is fairly straightforward with OpenCV. In this tutorial we will convert the image to the YIQ color space and process only the Luma channel, please see notebook for the full code.

scale_factor = 1.0 # OPTIONAL: scale images so they fit in memory :)
bgr_frames = []
frames = [] # frames for processing
cap = cv2.VideoCapture(video_path)

# video sampling rate
fs = cap.get(cv2.CAP_PROP_FPS)

idx = 0

while(cap.isOpened()):
ret, frame = cap.read()
# if frame is read correctly ret is True
if not ret:
break

if idx == 0:
og_h, og_w, _ = frame.shape
w = int(og_w*scale_factor)
h = int(og_h*scale_factor)

# store original frames
bgr_frames.append(frame)

# get normalized YIQ frame
rgb = np.float32(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)/255)
yiq = rgb2yiq(rgb)

# append Luma channel
frames.append(cv2.resize(yiq[:, :, 0], (w, h))

idx += 1

Here is a sample of the Luma channel frames that we will be working with:

Figure 1. Right: Luma channel of frame 0. Left: Difference between Luma channel of frames 1 and 0. Source: Author.

We can see faint outline of the crane in the frame difference on the right side of figure 1. This indicates that something interesting could be happening with these pixels, but it’s likely attributed to noise since we get a similar outline of the building that isn’t moving. This algorithm is able to pull out the subtle motions of less than a pixel by using many video frames. The next task is to set the hyperparameters:

# Temporal bandpass filter frequencies
f_lo = 0.2
f_hi = 0.25

# weighted amplitude blur strength
sigma = 5.0

# attenuate other frequencies
attenuate = True

# phase magnification factor
phase_mag = 25.0

Gaussian Kernel

Now let’s continue our preprocessing and obtain the Gaussian Kernel for the amplitude weighted blur. As noted in part 1, this is Optional and we can still obtain good results without it.

# ensure ksize is odd or the filtering will take too long
# see warning in: https://pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html
ksize = np.max((3, np.ceil(4*sigma) - 1)).astype(int)
if ((ksize % 2) != 1):
ksize += 1

# get Gaussian Blur Kernel for reference only
gk = cv2.getGaussianKernel(ksize=ksize, sigma=sigma)
gauss_kernel = torch.tensor(gk @ gk.T).type(torch.float32) \
.to(device) \
.unsqueeze(0) \
.unsqueeze(0)
Figure 2. Gaussian Kernel used for Amplitude Weighted Blurring. Source: Author.

Bandpass Filter

For the bandpass filter, we will use scipy.signal to construct an FIR filter. The first thing we need to do is normalize our frequencies to the Nyquist Rate, this basically means that the valid frequencies will range from [0–1] after the normalization. Any normalized frequency greater than 1, will be aliased.

from scipy import signal

# normalize freqeuncies to the Nyquist rate
norm_f_lo = f_lo / fs * 2
norm_f_hi = f_hi / fs * 2

# Get Bandpass Impulse Response
bandpass = signal.firwin(numtaps=len(frames),
cutoff=[norm_f_lo, norm_f_hi],
pass_zero=False)

# we could pass unnormalized freqs as long as we include the sample rate fs
# bandpass = signal.firwin(numtaps=len(frames),
# cutoff=[f_lo, f_hi],
# pass_zero=False,
# fs=fs)

# Get Freqeuncy Domain Transfer Function
transfer_function = torch.fft.fft(
torch.fft.ifftshift(torch.tensor(bandpass))).to(device) \
.type(torch.complex64)
transfer_function = torch.tile(transfer_function,
[1, 1, 1, 1]).permute(0, 3, 1, 2)

Before we move on, let’s take a look at the frequency domain response of the filter.

norm_freqs, response = signal.freqz(bandpass)
freqs = norm_freqs / np.pi * fs/ 2

_, ax = plt.subplots(2, 1, figsize=(15, 7))
ax[0].plot(freqs, 20*np.log10(np.abs(response)));
ax[0].set_title("Frequency Response");

ax[1].plot(freqs, np.angle(response));
ax[1].set_title("Phase Response");
Figure 3. Frequency Domain Response of our Bandpass FIR Filter. Source: Author.

The The top plot shows our frequency response, and we can see that the band roughly corresponds to the 0.2–0.25 frequencies that we initially set.

Complex Steerable Pyramid

Now we can get our Complex Steerable Pyramid Class and as noted in part 1, Sub-Octave Filters allow for better motion magnification so we opt for half-octave filters.

max_depth = int(np.floor(np.log2(np.min(np.array(ref_frame.shape)))) - 2)
csp = SteerablePyramid(depth=max_depth,
orientations=8,
filters_per_octave=2,
twidth=0.75,
complex_pyr=True)

Here’s the partitions of all the oriented sub-bands in frequency space, the higher values in red are due to the overlapping of the filters. Steerable Pyramid decompositions are overcomplete, meaning that they contain more than enough information to reconstruct the image. The Complex Steerable Pyramid only captures positive frequencies since the negative frequencies can be obtained via Complex Conjugation.

Figure 4. Oriented Sub-Band Partitions of all Complex Steerable Filters in Frequency Space. Left: Is the sum of all Sub-Band Filters to show the intensity of the overlap. Right: Color coded Sub-Band Filters. Source: Author.

Final Preprocessing steps

Next we convert the filters and video frames to PyTorch Tensors:

filters_tensor = torch.tensor(np.array(filters)).type(torch.float32).to(device)
frames_tensor = torch.tensor(np.array(frames)).type(torch.float32).to(device)

Then we take the Discrete Fourier Transform (DFT) of the video frames:

video_dft = torch.fft.fftshift(
torch.fft.fft2(frames_tensor, dim=(1,2))).type(torch.complex64) \
.to(device)

And finally we create a tensor to store the DFT of the motion magnified frames.

recon_dft = torch.zeros((len(frames), h, w), dtype=torch.complex64).to(device)

Performing the Processing

This section covers step 3

Now we can finally start the processing, where our outer loop with will encompass the Pyramid filters. Since we are using PyTorch we can conveniently parallelize this on a GPU. The first thing we do is create a tensor to store the Phase Deltas which are the differences between the phase of the current frame and the reference frame.

phase_deltas = torch.zeros((batch_size, len(frames), h, w), 
dtype=torch.complex64).to(device)

The loop over the filters excludes the High and Low Pass components, we will add those back in after the main processing. Here’s the entire block of code for step 3:

for level in range(1, len(filters) - 1, batch_size):

# get batch indices
idx1 = level
idx2 = level + batch_size

# get current filter batch
filter_batch = filters_tensor[idx1:idx2]

## get reference frame pyramid and phase (DC)
ref_pyr = build_level_batch(video_dft[ref_idx, :, :].unsqueeze(0), filter_batch)
ref_phase = torch.angle(ref_pyr)

## Get Phase Deltas for each frame
for vid_idx in range(num_frames):
curr_pyr = build_level_batch(
video_dft[vid_idx, :, :].unsqueeze(0), filter_batch)

# unwrapped phase delta
_delta = torch.angle(curr_pyr) - ref_phase

# get phase delta wrapped to [-pi, pi]
phase_deltas[:, vid_idx, :, :] = ((torch.pi + _delta) \
% 2*torch.pi) - torch.pi

## Temporally Filter the phase deltas
# Filter in Frequency Domain and convert back to phase space
phase_deltas = torch.fft.ifft(transfer_function \
* torch.fft.fft(phase_deltas, dim=1),
dim=1).real
## Apply Motion Magnifications
for vid_idx in range(num_frames):

curr_pyr = build_level_batch(video_dft[vid_idx, :, :].unsqueeze(0), filter_batch)
delta = phase_deltas[:, vid_idx, :, :]

## Perform Amplitude Weighted Blurring
if sigma != 0:
amplitude_weight = torch.abs(curr_pyr) + eps

# Torch Functional Approach (faster)
weight = F.conv2d(input=amplitude_weight.unsqueeze(1),
weight=gauss_kernel,
padding='same').squeeze(1)

delta = F.conv2d(input=(amplitude_weight * delta).unsqueeze(1),
weight=gauss_kernel,
padding='same').squeeze(1)

# get weighted Phase Deltas
delta /= weight

## Modify phase variation
modifed_phase = delta * phase_mag

## Attenuate other frequencies by scaling magnitude by reference phase
if attenuate:
curr_pyr = torch.abs(curr_pyr) * (ref_pyr/torch.abs(ref_pyr))


## apply modified phase to current level pyramid decomposition
# if modified_phase = 0, then no change!
curr_pyr = curr_pyr * torch.exp(1.0j*modifed_phase) # ensures correct type casting

## accumulate reconstruced levels
recon_dft[vid_idx, :, :] += recon_level_batch(curr_pyr, filter_batch).sum(dim=0)

Let’s cover this step by step, the first thing we do is get the current batch of filters. So we are only building a portion of the Pyramid at a time instead of the whole thing at once.

# get batch indices
idx1 = level
idx2 = level + batch_size

# get current filter batch
filter_batch = filters_tensor[idx1:idx2]

In the next step we are building the reference pyramid and getting the reference phase for the filter batch.

## get reference frame pyramid and phase (DC)
ref_pyr = build_level_batch(video_dft[ref_idx, :, :].unsqueeze(0),
filter_batch)
ref_phase = torch.angle(ref_pyr)

In the next step, we go into our first inner loop where we iterate over all the video frames. The main purpose of this step is to obtain the phase deltas for all the frames for the current filter batch.

## Get Phase Deltas for each frame
for vid_idx in range(num_frames):
# compute pyramid decomposition -> OPTIONALLY store in data structure
curr_pyr = build_level_batch(
video_dft[vid_idx, :, :].unsqueeze(0), filter_batch)

# unwrapped phase delta
_delta = torch.angle(curr_pyr) - ref_phase

# get phase delta wrapped to [-pi, pi]
phase_deltas[:, vid_idx, :, :] = ((torch.pi + _delta) \
% 2*torch.pi) - torch.pi

The last line of this loop ensures that the phase deltas are wrapped to [-π, π]. The next step is to perform the Temporal Bandpass filtering in the Frequency Domain, and we do this with the one-liner below where we take the inverse DFT to go back to the time domain.

## Temporally Filter the phase deltas
# Filter in Frequency Domain and convert back to phase space
phase_deltas = torch.fft.ifft(transfer_function \
* torch.fft.fft(phase_deltas, dim=1),
dim=1).real

Now we are ready to start modifying the motion, we go to our second loop where we iterate through the video frames again. The first thing we do is acquire batches of Pyramid Levels and their corresponding filtered phase deltas. Then we can optionally perform Amplitude Weighted Blurring which is explained in part 1.

## Apply Motion Magnifications
for vid_idx in range(num_frames):
# rebuild level or grab from stored tensor
curr_pyr = build_level_batch(video_dft[vid_idx, :, :].unsqueeze(0),
filter_batch)
delta = phase_deltas[:, vid_idx, :, :]

## Perform Amplitude Weighted Blurring
if sigma != 0:
amplitude_weight = torch.abs(curr_pyr) + eps

# Torch Functional Approach (faster)
weight = F.conv2d(input=amplitude_weight.unsqueeze(1),
weight=gauss_kernel,
padding='same').squeeze(1)

delta = F.conv2d(input=(amplitude_weight * delta).unsqueeze(1),
weight=gauss_kernel,
padding='same').squeeze(1)

# get weighted Phase Deltas
delta /= weight

Next we can actually modify the motion for each frame in the current filter batch. The seconds part of this snippet has something we haven‘;’t covered yet, and that is frequency attenuation.

## Modify phase variation
modifed_phase = delta * phase_mag

## Attenuate other frequencies by scaling magnitude by reference phase
if attenuate:
curr_pyr = torch.abs(curr_pyr) * (ref_pyr/torch.abs(ref_pyr))

The frequency attenuation removes all frequencies expect the ones we want to modify, but how? It does this by first taking the magnitude of the current pyramid decomposition which removes all phase information, but we know that we can’t reconstruct an image without phase information. For this reason we need to add some sort of phase info, we choose to add the DC phase (or reference frame phase) which is done by multiplying by the normalized reference frame.

We are then able to add the modified phase to the current pyramid batch. We do this by placing the phase in a Complex Exponential and multiplying it with the current pyramid.

## apply modified phase to current level pyramid decomposition
# if modified_phase = 0, then no change!
curr_pyr = curr_pyr * torch.exp(1.0j*modifed_phase) # ensures correct type casting

We then reconstruct the pyramid to the get the DFT of the modified Pyramid Batch and accumulate it in our results tensor.

## accumulate reconstruced levels
recon_dft[vid_idx, :, :] += recon_level_batch(curr_pyr, filter_batch).sum(dim=0)

Collapsing the Pyramid

This section covers steps 4 and 5

Now we can discuss the reconstruction, the first thing we need to do is reincorporate the High and Low Pass components. I found that the High Pass Component is not important for an accurate reconstruction, so you can optionally leave it out.

# adding hipass component seems to cause bad artifacts and leaving
# it out doesn't seem to impact the overall quality
hipass = filters_tensor[0]
lopass = filters_tensor[-1]

## add back lo and hi pass components
for vid_idx in range(num_frames):
# accumulate Lo Pass Components
curr_pyr_lo = build_level(video_dft[vid_idx, :, :], lopass)
dft_lo = torch.fft.fftshift(torch.fft.fft2(curr_pyr_lo))
recon_dft[vid_idx, :, :] += dft_lo*lopass

# OPTIONAL accumulate Lo Pass Components
curr_pyr_hi = build_level(video_dft[vid_idx, :, :], hipass)
dft_hi = torch.fft.fftshift(torch.fft.fft2(curr_pyr_hi))
recon_dft[vid_idx, :, :] += dft_hi*hipass

Now we can perform the inverse DFT across all frames, this is the final step for the reconstruction.

result_video = torch.fft.ifft2(torch.fft.ifftshift(recon_dft, dim=(1,2)), dim=(1,2)).real
result_video = result_video.cpu() # Remove from CUDA

Visualizing the Results

We can create a side-by-side comparison image for each frame.

Figure 5. Side-by-side comparison of each modified video frame. Source: Author.

We can even create a GIF, we only processed the Luma channel which corresponds to image brightness. This is typically enough to give good results as opposed to processing all channels.

Figure 6. GIF of Phase Based Motion Processing results. Source: Author.

Visualizing the details

Now that we’ve seen the results in action, let’s take a look at what’s really going on. Here’s an image that does just that, the code to generate these images is in the final part of the notebook.

Figure 6. Detailed Motion Amplification across all frames. Source: Author.

The green lines shows the single pixel line that we are inspecting, we can visualize how these lines of pixel change throughout both the original and processed video and get a side-by-side comparison. On the rightside of figure 6, we can see that the structure of the crane in the original video clearly has not discernible motion throughout the entire sequence, but the amplified side definitely shows motion! Better yet we can see that the amplified motion is different for the two different regions indicating that the technique can adapt to localized motions. In other words, we’re not just applying a mindless amplification factor, we are amplifying motions based on how the objects are actually moving. This shows that the technique is not a fancy visual effect but it is indeed a visual microscope for motions.

Conclusion

Implementing Phase Based Motion Magnification is a video processing technique that is able to amplify imperceptible motions in seemingly static videos. The technique is computationally expensive due to the large Pyramid decomposition, but a GPU implementation is able to greatly speed up the process, from more than 5 minutes using a pure numpy implementation to less than 30 seconds on a 6GB GPU.

The results aren’t perfect, but the processing is able to clearly amplify the otherwise invisible motion. There are some new Deep Learning based methods that are able to provide more clear results.

Thanks for Reading to the end! If you found this post useful please consider clapping 👏 It really helps others who might also find it useful

Links:

References

[1] Portilla, Javier & Simoncelli, Eero. (2000). A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients. International Journal of Computer Vision. 40. 10.1023/A:1026553619983.

[2] Wadhwa , N., Rubinstein , M., Durand, F., & Freeman , W. T. (2013). Phase-Based Video Motion Processing. Phase-based video motion processing. https://people.csail.mit.edu/nwadhwa/phase-video/

[3] https://rxian.github.io/phase-video/

--

--