A SNEAK PEAK INTO FUTURE

Nguyen D. Le
Sifu.art
Published in
4 min readNov 28, 2022

For this week, implementing transfer learning is very exhausting. Take a look at this graph to see how complex path data has to pass through to produce the final judgement of right or wrong postures and movements.

Deep Learning Model Graph

Should we take a break? “Reaching for the stars, at least we won’t end up with a handful of mud.” The application is vast for what we are doing. One of those is upping the game with sticks, ranging from a 15cm stick to a 3000cm long pole. Bellow is the longest one, with some fun accompanied.

(turn up your volume) Pole with Hua Yuan Jia background music

In retrospect, many people have shed sweat and tears for the technology we use today.

Starting out with Support Vector Machine, I built a machine-learning model [1] for human posture and movement evaluation. The result was the first step toward what I have accomplished in computer vision today. Based on observation of the limitation of depth feature extraction of the SVM model, I decided to try a mixed hardware and software solution with Microsoft Kinect SDK [2] and Random Forests [3], and became a five-star rated freelancer on Kinect programming in the process [4].

The Kinect-based model result was very promising. However, those solutions had two significant obstacles: 1. Data and algorithms were closed source, so it was hard to do any extension work 2. Hardware equipment was expensive and bulky — I hoped Microsoft would improve those shortcomings, but they instead terminated the Kinect product line, leaving all my work covered by dust.

My dream of creating a virtual Kung Fu algorithm was revived in 2018, when a series of new breakthrough research work in 2D and 3D human pose estimation was made available by Stacked Hour Glass Network [7] techniques. Such progress was realized in Vnect [5], OpenPose[6], and other similar projects. I managed to transfer learning from the HG3D model[8] and produced workable results [9]. However, due to the nature of 2D to 3D lifting from monocular camera input, problems such as body part occlusion, no image registration, and lack of camera intrinsic and extrinsic parameters at inference time arose.

With recent advancements in the field, such as VoxelTrack [10], Epipolar Pose [11], MvP [12] utilizing fully supervised multi-view training data with ground truth key points and camera parameters thanks to Panoptic dataset [13] or unsupervised as in MetaPose [14], the human pose estimation field of research is more excited than ever.

References

[1] Do, TN., Pham, TP., Pham, NK., Nguyen, HH., Tabia, K., Benferhat, S. (2020). Stacking of SVMs for Classifying Intangible Cultural Heritage Images. In: Le Thi, H., Le, H., Pham Dinh, T., Nguyen, N. (eds) Advanced Computational Methods for Knowledge Engineering. ICCSAMA 2019. Advances in Intelligent Systems and Computing, vol 1121. Springer, Cham. https://doi.org/10.1007/978-3-030-38364-0_17

[2] Kohli, P., & Shotton, J. (2013). Key Developments in Human Pose Estimation for Kinect. Στο Consumer Depth Cameras for Computer Vision (Consumer Depth Cameras for Computer Vision). Ανακτήθηκε από https://www.microsoft.com/en-us/research/publication/key-developments-in-human-pose-estimation-for-kinect/

[4] N. Le. Profile. (n.d.). Freelancer. https://www.freelancer.com/u/QHoach

[5] Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H. P., … & Theobalt, C. (2017). Vnect: Real-time 3d human pose estimation with a single rgb camera. Acm transactions on graphics (tog), 36(4), 1–14.ation with a single rgb camera.” Acm transactions on graphics (tog) 36.4 (2017): 1–14.

[6] Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2018). OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR, abs/1812.08008. Ανακτήθηκε από http://arxiv.org/abs/1812.08008

[7] Newell, A., Yang, K., Deng, J. (2016). Stacked Hourglass Networks for Human Pose Estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision — ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9912. Springer, Cham. https://doi.org/10.1007/978-3-319-46484-8_29

[8] Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017). Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision (pp. 398–407).

[9] N. Le (2022, November 15). Moving from 2D Inferene to 3D Estimation on Human Posture.Medium. https://medium.com/sifu-art/moving-from-2d-inference-to-to-3d-estimation-on-human-posture-3388b2bb7415

[10] Zhang, Y., Wang, C., Wang, X., Liu, W., & Zeng, W. (2022). Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1077–1086).

[12] Zhang, J., Cai, Y., Yan, S., & Feng, J. (2021). Direct multi-view multi-person 3d pose estimation. Advances in Neural Information Processing Systems, 34, 13153–13164.

[13] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., … & Sheikh, Y. (2015). Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3334–3342).

[14] Usman, B., Tagliasacchi, A., Saenko, K., & Sud, A. (2022). MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6759–6770).

[15] Pham, N. K., Morin, A., Gros, P., & Le, Q. T. (2010). Intensive use of correspondence analysis for large scale content-based image retrieval. In Advances in Knowledge Discovery and Management (pp. 57–76). Springer, Berlin, Heidelberg.

--

--