Video Summarization using Keyframe Extraction and Video Skimming

Shruti Jadon
2 min readFeb 9, 2018

--

Many Algorithms exist in the market which claim to summarize videos. This blog will go through some of the standard methods used and discuss the outcomes obtained.

The code to this can be obtained at Github Repository.

and Paper can be found at Arxiv Link

For this project, I used both keyframe extraction Methods. For static keyframe extraction, we extract low-level features using uniform sampling, image histograms, SIFT, and image features from Convolutional Neural Network (CNN) trained on ImageNet. We also use different clustering methods, including K-means and Gaussian clustering. We use video skims around the selected keyframes to make the summary fore fluid and understandable for humans. We take inspiration from the VSUMM method, which is a prominent method in video summarization.

Methods Used:

  1. Uniform Sampling
  2. Image histogram
  3. Scale Invariant Feature Transform
  4. VSUMM: This technique has been one of the fundamental techniques in video summarization in the unsupervised setup. The algorithm uses the standard K-means algorithm to cluster features extracted from each frame. Color histograms are proposed to be used in one paper. Color histograms are 3-D tensors, where each pixel’s values in the RGB channels determines the bin it goes into. Since each channel value ranges in 0 − 255, usually, 16 bins are taken for each channel resulting in a 16X16X16 tensor. Due to computational reasons, a simplified version of this histogram was computed, where each channel was treated separately, resulting in feature vectors for each frame belonging to R 48 . The nest step suggested for clustering is slightly different. But, the simplified color histograms give comparable performance to the true color histograms. The features extracted from VGG16 at the 2nd fully connected layer were tried, and clustered using kmeans.
  5. ResNet16 on ImageNet: Just same as VGG16, but only on ResNet16 and the last layer is considered as features. This we can obtain by chopping off the layer before the loss function.
Sample Results(Frames) Obtained for Videos.

Please cite us, if you find our code, paper, or explanation helpful:

Jadon, S., & Jasim, M. (2019). Video Summarization using Keyframe Extraction and Video Skimming. arXiv preprint arXiv:1910.04792.

Are you preparing for an upcoming Machine Learning/Data Science interview? If yes, make sure to check out https://www.datasciencepreparation.com/

--

--

Shruti Jadon

ML Researcher@Juniper Networks | BookAuthor | In past: Visiting Researcher@Brown University. CS grad@UMass Amherst. Website: https://www.shrutijadon.in/