week 2| GSoc’19 | CCExtractor Development | by Amit Kumar

4 min readJun 15, 2019

week 2 | GSoC’19 | CCExtractor Development

Image Source: **https://ccextractor.org**

Continuing with the last blog, This week my most of the time was put on making a clean code for the RESTful APIs. I restructured the project in the way that in future if new core library is supposed to be added then it can be easily integrated. In PR #5 Firstly I created a new Django app to keep core end APIs of the project. This time I was very much strict to DRY principle, I restructured the project’s module wherever the repetition of snippets was present.

Concurrently I was also working on the video processing part. I will give a small glimpse over its algorithm. So whenever a video is given as input, first and foremost the metadata of the video is parsed. The main metadata are frame per second (fps in short), total no. of frames, duration of the input video. To get the timestamp of each frame, frame no is divided by fps. I also chose an option to give the extent of precision a user wants the result in output. Here precision means no. of frames a user wants to process in second (like if video is of 30fps and user wants to process 24fps). But this option is currently not in direct control of user but later I’ll make this option available. Each frame is processed and a dictionary is maintained which keeps the face_id as the key and timestamps array as it’s value. Then after processing all the frames, this dictionary is sent for further processing. Now the coalescence of timestamps here takes place. suppose two timestamps are very much closer then they are chained. The coalescence of timestamps threshold is automatically determined based on the selection of no. of frames to be processed.

But I was not satisfied with the current approach as it is very time-consuming. I recently came across scikit-video which uses ffmpeg behind the scene.

Here is the benchmark of scikit-video and opencv python .
to load a single frame

time taken by scikit-video : 0.005018000000000189
time taken by cv2 : 0.0228900000000003

time taken to load 5637 frames sequentially

time taken by scikit-video 8.149045999999998
time taken by cv2 : 55.439571

clearly, it was an indicator of changing the hammer right now.I also found that the current method (scikit-video)proved to be more robust.

Currently,Now I’m using vreader() method in scikit-video which gives a generator while vread() can be used to load the whole video in one go and outputs numpy array, which I find more useful. I can still optimize at this point also since I’m not actually reducing the frame size of the input. Not only this In future I’m having a more robust way to decrease the time taken and also the increase the accuracy . Want to know the method? stay tuned for coming blogs. For the latest update, I also actively update the GitHub kanban board.

Let’s explore about face embeddings little bit. So, what does a face embedding is exactly? Each face is compactly represented by a 128-dimensional byte vector. Now why it’s 128 why not 256 or 64? 128 is selected because according to[FaceNet] paper they performed experiments with these dimensions and below is the result.

Source:FaceNet (https://arxiv.org/abs/1503.03832)

The differences in the performance reported above are statistically insignificant. Therefore they selected 128. However, it is possible that higher dimension requires more training to achieve the same accuracy. What kind of values is present in the 128D embedding array? It contains floating point values [0.03861399, -0.04976186, … ,0.09530648 -0.05199577].

Let’s check how the numbers actually look like.

Above each image is 8x16 pixels image, therefore, have total of 128dimensions each holding the value of one specific pixel. You can somewhat tell the difference between FaceID0 and FaceID1 . Below is the representation of 128D{high dimension} into 2D dimension using the result of PCA(Principle Component Analysis) which is actually used to reduce the dimensionality of a dataset. PCA uses eigenvalues and eigenvectors of the data-matrix. These eigenvectors of the covariance matrix have the property that they point along the major directions of variation in the data.

You can observe how similar face embeddings are being closer to each other while being apart from different kinds of faces embedding. The difference is huge because above is a representation of of a women’s and a men’s face embeddings.

Let’s explore by using t-SNE(t-Distributed Stochastic Neighbouring Entities), 3D representation of the same embeddings. t-SNE uses a probablistic approach instead of a mathematical technique.

All the above outputs are a result of scikit-learn, seaborn and matplotlib and numpy. In coming blogs I write more insightful blogs so, stay tuned for coming blogs!

Profile links LinkedIn GitHub Twitter

References:

http://www.scikit-video.org/

Written by Amit Kumar