Deepiracy: Video piracy detection system by using Longest Common Subsequence and Deep Learning

Carlos Toxtli
HCI@WVU
Published in
10 min readMay 25, 2018

Por Claudia Saviaga, Carlos Toxtli

Video clip localization is important for real life applications such as detecting copyright issues in videos, which has become crucial due to the increasing amount of videos uploaded to social media and video platforms. In this article we present a novel approach to do video clip localization. Our method is capable of detecting videos that have suffered from distortions (such as change in illumination, rotation) or even screened content, i.e. content that was recorded using a smart phone in a movie theater. We combine the longest common subsequence as way to measure similarity between videos and neural networks for object detection. Our results demonstrate the efficiency of combining these two methods. We also present results of the performance that our method achieves in real-time tests.

Introduction

Video clip localization consists of identifying video segments in a video stream. This is important for real life applications such as detecting copyright issues in video platforms. Managing the copyright of the huge number of videos uploaded everyday is a critical challenge for video platforms. There are several techniques to deal with video localization, for instance, some approaches consider bipartite graph matching to measure video clip similarity with a target video stream; however, these approaches do not cope with the important problem known as near-duplicate video clip/copy detection . Near-duplicate video copies are those video copies derived from the same original copy, by some global transformation such as video re-formatting and color shifting, or some local changes such as frame editing. Other methods deal with video copies by using principles of temporal consistency, which are difficult to thwart without significantly degrading the user’s viewing experience — contrary to the pirate’s goals. These approaches take advantage of a new temporal feature to index a reference library in a manner that is robust to popular spatial and temporal transformations in pirated videos. However, they have limitations, for example a pirate could temporally smooth a video’s gradient features to hide the sudden illumination changes used by these methods.

Deepiracy

In this article we introduce Deepiracy, an open-source anti-piracy tool that is able to detect distorted video clips in target streams in real-time by using deep learning and LCS (Longest Common Subsequence) a popular string algorithm. Figure 1 shows the system components:

Figure 1. Flowchart of the proposed approach.

The process works as follows:

  • We find the first frame (anchor frame) from the source video that matches in the target video.
  • Then we run a neural network based object classifier and get the objects on the images from the source video. For our tests we used TFSlim + SSDLite + MobileNet_V2 + COCO as is conveniently lightweight and used for real-time applications.
  • Once the objects are detected we convert them to symbols of an alphabet.
  • Then, we track the objects found in the anchor frame of the source video in the target video, advancing frame by frame until no object is present (as enough frames have passed and the object is no longer within the scene of the target video).
  • We skip the number of frames advanced on the source video to match the one in which the object was lost in the target video, this frame is used as the new anchor frame. By skipping frames and using anchor frames, we avoid comparing and processing frame by frame, which would be too costly in terms of performance.
  • We repeat the process from step 2 until the last frame in target video is processed.

The methods that we used for the process explained are the following:

  • Feature Extraction (SURF): Feature extraction allows us to represent the content of images in vectors. It also allows us to reduce the number of key-points of the source image to compare in the target image. We used SURF that is a popular feature detector that supports several images distortions.
  • Feature Matching (FLANN and KD Trees): Feature matching allows us to measure the difference in distance (i.e. Hamming, Euclidean) between the feature vectors extracted with SURF. FLANN creates KD-Trees to optimize the measuring.
  • Homography (RANSAC): RANSAC is a popular algorithm that we use to identify distortions. It chooses a subset of the points from one image, match these to the other image and compute the transformation which minimizes the re-projection error
  • Tracking (KCF): KCF is a correlation filter based tracker. It is a predictive type of tracker as it tries to predict the next position of the object.
  • Object Detection Algorithm: We use the TensorFlow Object Detection API , which is an open source framework built on top of TensorFlow to train and deploy object detection models based on neural networks. The model that we use is SSDLite + MobileNetV2 trained with the COCO dataset which contains over 80 categories of images.

Transforming images to symbols of an alphabet

In this article we formulate the problem of video sequence to sequence matching as a pattern matching problem. This is, we capture the information in a video sequence as a string representation. More specifically, we use the Longest Common Subsequence (LCS) as a measure of the similarity between the sequences.
The LCS representation provides an intuitive method to model objects observed in the video sequences. We try to approaches:

Modeling category of objects: The Figure 2 shows an example of category representation. In this case the algorithm assigns a symbol within an alphabet to the same category object, in this case P represents a person and B represents a bottle. In frame a) the algorithm assign a P to each person and a B to the bottle, and the same happens in frame b) even though the persons and the bottle are different.

Figure 2. Category Representation: algorithm assigns a symbol within alphabet to the same category object, in this case P for Person and B for bottle.

The Longest Common Subsequence (LCS) table can be found in Figure 3. In this case the S1={PPBB}, S2= {PPB} and the LCS is PPB. For this example the alphabet has the symbols {P,B}.

Figure 3. LCS table (Category Representation).

Modeling individual objects: Figure 4 shows an example of individual object representation. In a) the algorithm assigns a symbol within an alphabet to each object detected in the frame. In this case assigns symbol 1 and 2 to the persons and 3 to the bottle.

Figure 4. Individual Object Representation: algorithm assigns a symbol within alphabet to each object, 1 and 2 to the persons and 3 to the bottle.

Then finds an exact match in the other video frame b). In Figure 5, we can see that S1={1234}, s2={2,3,4} and LCS={234}, the alphabet in this case has the symbols {1,2,3,4}

Figure 5. Individual Representation.

A working example of the proposed approach can be seen in Figure 6. In there we compare two videos, we call them source and target. We use alphabet {a,b}. The first step is to detect the objects in the source video (frame 1) and assign a symbol from the alphabet to each one, then we track those objects in the target video (also frame 1). For simplicity we only track the two persons that appear in the video, therefore our alphabet has only two symbols. We can then skip frames in target video (frame 50 and 100) and keep tracking until a similarity measure is different from a predefined threshold, this is there are not enough objects from the source video in the target video. (in this case we use 0.8 as our threshold). In case of object detection at least one of the detected objects in the source video must be present in the target video.

Figure 6. Proposed approach

Tests and Results

To test our proposed method we followed two approaches:

  • Exact Matching: We compared the same image in the source and the target as seen in Figure 7.
Figure 7. Exact Matching Approach
  • Inexact Matching: We compared and image projected in an screened content as seen in Figure 8.
Figure 8. Inexact Matching Approach

Exact Matching Tests
For exact matching we used two videos with the followings characteristics:

  • Video 1: Contains people and objects with a duration of 2:09 seconds and 3 different scenes.
  • Video 2: Contains only people with a duration of 1:33 seconds and 2 different scenes.

For each video we test the feature descriptor, the category representation and the individual representation.
The videos for the exact matching tests can be accessed here:

Only Feature Descriptor Video:
Video 1: https://youtu.be/heqs1AOBPoo
Video 2: https://youtu.be/mhrnrH20PkI

Object Detection Category Video:
Video 1: https://youtu.be/W6mIuL2-u0c
Video 2: https://youtu.be/jlk7a2K5r_o

Object Detection Individual Video:
Video 1: https://youtu.be/7pM-b7hQjPU
Video 2: https://youtu.be/ldTftRGMx2Y

Exact Matching Results

On Table 1 and Figure 9 we can see that representing the object with individual symbols of the alphabets is the best approach as we are able to reduce the time in 36.77% for video 1 and 46.21% for video 2.

Table 1. Exact matching time reduction
Figure 9. Time Reduction Exact Matching

The time reduction is bigger for video 2 because it only contains two different scenes, which means that the recalculation have to be done less times than video 1 which contains one more scene. Figure 10, Figure 11.
and Figure 9 show the detailed graphs related to time, skips and maximum number of skips.

Figure 10. Metrics Exact Matching Video 1
Figure 11. Metrics Exact Matching Video 2

Inexact Matching Tests
For inexact matching we used two videos with the followings characteristics:

Video 1: The source video contains scenes with two people in it, while the target video is the same scene projected in a cell phone.
Video 2: The source video is a scene with two people and one object while the target video is the source video projected in a cell phone.

For each video we test the feature descriptor, the category representation and the individual representation.

The videos for the inexact matching tests can be found here:

Only Feature Descriptor Video:
Video 1: https://youtu.be/ARhP4TwU314
Video 2: https://youtu.be/XtAHfkZFPKU

Object Detection Category Video:
Video 1: https://youtu.be/DSsqcN8dQn8
Video 2: https://youtu.be/rC2feFcIzx4

Object Detection Individual Video:
Video 1: https://youtu.be/_4dmleAOdq0
Video 2: https://youtu.be/-AiivR6ps7E

Inexact Matching Results

On Table 2 and Figure 12 we can see that representing the object with individual symbols of the alphabets was the also best approach as we were able to reduce the time in 39.52% for the first video and 62.52% for the second video.

Table 2. Inexact matching time reduction
Figure 12. Time Reduction Inexact Matching

In this case the time reduction also is influenced by the number of scenes in each video but also by the shaking of the smart phone of the person doing the test. Figure 13, Figure 14.
and Figure 12 show the detailed graphs related to time, skips and maximum number of skips. The final comparison for both methods can be seen in Figure 12.

Figure 13. Metrics Inexact Matching Video 1
Figure 14. Metrics Inexact Matching Video 2

Exact VS Inexact

Inexact matching tended to be faster than exact because the target device size was usually smaller than the source size, this is why the processing over the target was faster as is shown in Figure 15.

Figure 15. Time Reduction Exact vs Inexact Matching

Real-time tests

We tested the three approaches of inexact matching on real-time where a mobile device was playing a video from start to end while it was being recorded by a laptop web cam. In this setting, the important measurement is the number of frames skipped, since the processing time is the same than the video time.

The videos for the real-time tests can be accessed here:

Only Feature Descriptor video:
Video 1: https://youtu.be/YO68ybWERDc
Video 2: https://youtu.be/1Krhz-3Wiao

Object Detection Category Video:
Video 1: https://youtu.be/zKz9i0BjTe8
Video 2: https://youtu.be/VdMIOV7Iz28

Object Detection Individual Video:
Video 1: https://youtu.be/jh44X8gvKbQ
Video 2: https://youtu.be/7hgg85RC9Kk

Real-time Results

We can observe the results in Figure 16 and Figure 17 which show that both videos skipped more frames in the feature description condition, this means that more anchor frames were used and more processing was needed. In the object detection settings the individual approach performed much better in terms of frames skipped and number of frames skipped.

Figure 16. Real Time Matching Video 1
Figure 17. Real Time Matching Video 2

In terms of video fluency, the individual object approach felt more fluid that the rest of methods, this was because the average frames per second were 12 in comparison with 10 in the category object detection and only 5 in the features description approach.

The paper with the full evaluation can be found here.

Conclusions

In this article we propose a method for video clip localization. We showed that our method is capable of detecting piracy on screened videos. We used an approach that utilizes neural networks for object detection and Longest Common Subsequence (LCS) to translate objects in videos to patterns. Neural network algorithms are becoming faster and faster and this can be an ideal approach for alphabet based sequence matching. Our results showed that matching individual detected and tracked objects performed better than matching over category based objects and much better than feature description matching.

Code

The code of Deepiracy can be downloaded from here: https://github.com/toxtli/deepiracy

--

--

No responses yet