Putting the skeleton back in the closet

Does action recognition need pose estimation?

Twenty Billion Neurons
twentybn
8 min readDec 14, 2020

--

Pose estimation models have become a commodity and are notably appealing for the visual overlay they can provide; you simply take an image of a person, pass it through a neural network and out pop coordinates of all the body joints. But what if you wanted to go further and actually recognize actions? Can’t you just design hard-coded rules or train another neural network on top of its outputs? Sounds easy enough, but there would be several trade-offs. Obviously, hard coding rules is not scalable and only an option for the simplest of cases. For any sophisticated action recognition, you would need to train a second network.

Training on top of a pose estimation model restricts the information available to your classifier to a static snapshot of coordinates; losing out on all the intermediate features and temporal dynamics that help improve accuracy. This would typically require a larger dataset, as the second network has no pre-trained features encoding general human movement. It would be much easier if you could simply show the model a few examples of some action and get production-grade results; i.e. if the model had excellent few-shot learning capabilities out-of-the-box.

In this post, we present an alternative: an end-to-end approach that involves fine-tuning the final layers of a 3D convnet pre-trained on a large video corpus of general human actions. We show that this approach outperforms pose estimation based pipelines on two new video datasets, MiniFitness and fLatLunges (fine-grained lateral lunges) — especially in few-shot learning capabilities.

Two new video datasets for fitness

MiniFitness and fLatLunges are two new datasets designed to benchmark the accuracy of action classifiers when applied to fitness use cases with 100 videos per class available in each. While this is not enough to train an accurate model from scratch, this should be sufficient to evaluate the transfer learning capabilities of pre-trained models. In addition to the full version, subsets of each dataset with 50, 20, 10 and 5 samples per class were used to evaluate few-shot learning capabilities of the tested models. The two datasets mostly differ in the level of label granularity; fLatLunges being significantly more fine-grained than MiniFitness.

MiniFitness overview

The MiniFitness dataset shows a person performing one of 12 fitness exercises:

  • Alternating Lateral Lunges
  • Alternating V ups
  • Dead Bugs
  • Dead Bugs (legs only)
  • Glute Hamstring Walkout
  • Inchworm
  • Spiderman Pushup
  • Standing Fire Hydrant (left)
  • Standing Fire Hydrant (right)
  • Standing T
  • Standing YTW
  • Yoga Pushup

In addition to the 12 exercises, the dataset also contains 3 additional background classes, bringing the total number of classes to 15.

  • Doing nothing
  • Doing other things
  • No person visible

More than 300 people contributed to the dataset collection, with each recording videos in their own environment, resulting in a varied collection of backgrounds and lighting conditions.

Figure 1.a: Alternating V-Ups
Figure 1.b: Standing T
Figure 1.c: Glute Hamstring Walkout
Figure 1.d: Spiderman Pushups

Figure 1: Sample frames from four MiniFitness videos along with their corresponding labels

fLatLunges overview

This dataset can enable evaluation of a model’s ability to detect subtle differences within the same exercise. In total, there are 13 classes, covering 10 lateral lunge variations and 3 background classes with an average video length of 8 seconds:

  • Good form
  • No stepping
  • Not alternating
  • Stepping foot pointing away
  • Too fast
  • Too narrow
  • Too shallow
  • Torso bent forward
  • Torso bent sideways
  • Wrong knee bent

This dataset also includes three background classes:

  • Doing nothing
  • No person visible
  • Other fitness moves

Differentiating between fine-grained exercise variations can make all the difference between a fitness app that simply counts reps, and one that can give accurate, real-time feedback on form, while counting reps all the same.

Figure 2.a: No stepping
Figure 2.b: Torso bent forward
Figure 2.c: Torso bent sideways

Figure 2: Sample frames from three fLatLunges videos along with their corresponding labels

Both datasets will be publicly released soon.

The models we’ve considered

We compared the performance of a fitness activity classifier built on top of two distinct approaches: pose estimation and regular 3D CNN.

Pose estimation-based approach

With a pose estimation based approach, the pipeline consists of two components: a pose estimation model that translates input frames into a skeleton of key points, and a second neural network that learns to classify a sequence of input skeletons to the corresponding activity. A recent approach with excellent results is the use of a spatial-temporal graph convolutional network (ST-GCN) on top of a pose estimation model.

Figure 3. High-level overview of two-steps pose-estimation based approaches as described in Sijie Yan et al. (2018)

Pose key-points on top of which both models are trained were extracted using the mmskeleton toolset available on GitHub. This model was run at 16 fps to match the runtime of the chosen end-to-end approach, providing the annotated key-points at the same frame-rate to the classifier networks.

For the second network in the pose-estimation-based pipeline, we picked from models used on the kinetics skeleton benchmark, available on paperswithcode. For the baseline models, we selected the ST-GCN implementation by Sijie Yan et al. (2018), one of the first to apply graph convolution to skeleton-based action recognition, and the model with leading performance and available implementation, MS-G3D from Liu et al. (2020).

The models were implemented by re-using the code found on available GitHub repositories implementing ST-GCN and MS-G3D.

We will share the full implementation to reproduce our results soon.

End-to-end approach

For the end-to-end approach, we selected the Strided-Inflated EfficientNet (SI-EN) model available on the Sense repository. A key feature of this 3D-CNN is that it is trained end-to-end to go from pixels to activity labels instead of making use of pose estimation, bounding boxes, or any form of frame-by-frame analysis as an intermediate representation. This model was pre-trained on millions of short, labelled video clips showing a wide range of dynamic human activities happening in front of the camera, pushing the network to understand the dynamics of the human body. It is important to note that only the background classes (e.g. doing nothing, no person visible etc.) from the datasets we presented were included in the pre-training.

SI-EN uses EfficientNet-Lite4, a 2D-CNN, as a backbone with a few modifications to some of the 2D convolutional layers. We’ve inflated 8 of the 2D convolutions temporally, effectively turning them into 3D convolutions taking inspiration from Carreira and Zisserman (2018). Two of the inflated convolutions are implemented with a stride of two, enabling a lower footprint output of 4 fps from a 16 fps stream. More information can be found by referring to our implementation here, with a more detailed blog-post describing the model to follow.

Setting up our experiments

Our experiments were designed to evaluate whether training an ST-GCN or MS-G3D network on top of a pose estimation model would lead to better performance on both the MiniFitness and fLatLunges datasets as compared to fine-tuning the final layers of SI-EN for the same, with respect to both overall and few-shot learning performance.

Both ST-GCN and MS-G3D networks were trained from scratch with weights initialized randomly across all experiments. For SI-EN, two versions of the same model were trained; one in which simply the output layer was replaced with a logistic regression and all other layers frozen (SI-EN) and another in which the last layer was replaced and the preceding 9 layers fine-tuned (SI-EN9). For the latter, 9 layers were chosen to be fine-tuned to include at least two 3D convolution operations. Final results were averaged across five trials.

What the cards revealed

Figure 4: Model performance across different subset sizes on (a) MiniFitness and (b) fLatLunges.

As shown in figure 4 a), both end-to-end models fine-tuned on top of MiniFitness performed better than the graph networks trained on top of pose outputs. This is especially true in the low-sample, few-shot learning regime supporting that training on top of a model pre-trained on general human actions lends useful features to the new task. The models trained on top of pose estimation baselines did not perform nearly as well in the low data regime and were still outmatched when trained on the full dataset.

When it comes to the more fine-grained labels in the fLatLunges dataset, SI-EN9 performed better than all others across all sample sizes as shown in figure 1. b). SI-EN performed better than ST-GCN and MS-G3D in a few-shot setting with 5 samples, and on par with ST-GCN between 10 to 50 samples. MS-G3D also performed worse than SI-EN with 10 samples but performed nearly the same on 20 and was better with 50 and 100.

The results can be seen in tabular form in tables 1 and 2 in the appendix.

So, pose estimation or end-to-end?

Pose estimation models have become the go-to approach for many considering to build action recognition classifiers because of the alluring appeal of the skeleton visuals they provide. However, as we’ve shown in our experiments, if few-shot capabilities are what you need, you would be better off training on top of a pre-trained model with established, relevant features.

When evaluated on both the datasets we’ve presented, fine-tuning the end-to-end solution outperforms two state-of-the-art graph convolutional models trained from scratch on top of a sequence of pose estimates when trained on a novel dataset. In particular, our approach yields impressive results in low data regimes when only a handful of samples are available per class. Putting together two models, with the first outputting static pose coordinates, creates an information bottleneck that limits the features accessible by the second network. An end-to-end approach gives direct access to more useful intermediate features that make few-shot learning possible — reducing your data overhead.

Don’t just take our word for it, go ahead and train on top of our models yourself. We can’t wait to see what you come up with!

TL;DR: No.

Authors: Antoine Mercier, Guillaume Berger, Sunny Panchal, Florian Letsch, Cornelius Boehm, Ingo Bax, Roland Memisevic

Appendix

Table 1: Mean and 95% confidence interval of the test set accuracy across five trials on varying numbers of samples per class in the training set of the MiniFitness dataset.
Table 2: Mean and 95% confidence interval of the test set accuracy across five trials on varying numbers of samples per class in the training set of the fLatLunges dataset.

--

--

Twenty Billion Neurons
twentybn

We are a German/Canadian A.I. company based in Berlin and Toronto. Visit us at https://20bn.com/