So, we are almost halfway through the coding period of GSoC 2019. And this blog post gives update on the final modifications made in the pipeline.
Modifications in the pipeline: The pipeline finalized previously needed few modifications for proper working in accordance with the task in hand. In the current pipeline (see this post), the three networks — proposal, classification, localization — combine together to provide accurate temporal segmentation of hand gestures in the video. But, our goals include proper classification of the hand gesture also which is not addressed properly in this architecture. And thus the pipeline was modified to include the approach mentioned in this paper. So, this approach not only segments the hand gestures temporally but also assigns a gesture label from among the labels available. Figure 1 shows the final pipeline.
The classification module consists of two 3D ConvNets — the detector network and the classifier network. When a video with multiple gestures is input to the detector network it does a frame-by-frame classification between gesture or no-gesture class. Now, the frames with the same gesture class are combined together and separately given as input to the classifier network. The classifier network assigns a label to this hand gesture video from among the list of gesture labels available. All-in-all, we get temporally segmented video with no-gesture labels or specific gestures, as required, assigned to each frame.
So, the next challenge in line is to implement detector and classifier networks and train it on hand gesture datasets to get a complete architecture hwich can annotate NewsScape Dataset.
That’s all for this post. Cheers!