Intro of Posture Detection, & thinking more on OpenPose I

14 min readJan 21, 2024

Background Review

Artificial Intelligence is a new technological science that studies and develops theories, methods, technologies and application systems used to simulate, extend and expand human intelligence. Artificial intelligence is a branch of computer science. It attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a similar way to human intelligence. Research in this field includes robotics, language recognition and translation, image recognition and processing, automatic speech recognition and natural language processing and text to speech, expert systems, etc. Since the birth of artificial intelligence, the theory and technology have become increasingly mature, and the field of application has continued to expand. It is conceivable that the technological products brought by artificial intelligence in the future will be the “containers” of all human wisdom. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is not human intelligence, but it can think like humans and may exceed human intelligence.

Artificial intelligence is a very challenging science and technology. Scholars who are engaged in this field must understand computer science, information technology, psychology, philosophy, and other relevant subjects. Artificial intelligence includes a very wide range of sciences. It is composed of different fields, such as machine learning, computer vision, etc. Generally speaking, one of the main goals of artificial intelligence research is to make machines capable of some tasks that usually require human intelligence.

There are more fundamental and theoretical definitions about artificial intelligence. Artificial Intelligence is the intelligence demonstrated by machines, which is designed and improved by human engineers to solve some problems by mimic human brains “perception”, “cognitive”, “making-decision”, etc. functions or processes. AI works by running neural networks for majority of cases [1–3]. The work processes can be divided into two sections: training and validation, inference. AI is trained and validated by prepared labelled or unlabeled useful datasets. After this, AI model and parameters are determined, and then are deployed to complete real tasks. Model structure, hyper-parameters can be improved in order to get higher accuracy. Training dataset can be improved to cover more general representative samples. Computation platforms can be improved to fast converge and to get higher accuracy. [3, 4]

On the other hand, there are always raised questions about the limitations or boundaries of artificial neural networks. Borel measurable functions can be learned [5], especially for approximately polynomial functions which lead Artificial Neural Network fast converge [5–8]. Other functions, such as Odd and Even, Prime and Nonprime, are difficult to learn by ANN [9, 10]. ANN has its unique structures, so this leads to its difficulty to approximate some problems which has few similar structure properties. In this way, redefining problems and converting data are essential for ANN to deal with [11–14].

Nowadays, for ANN, training dataset is more and more important. We can collect the data from different applications. However, that is not enough to ensure they are all “good” data, since we all know high quality data can lead to high quality algorithms. Hence data cleansing is much significant. Moreover, as we all know, data are separate into different platforms and applications. For example, in the seminar of “Use of Big Data in Weather Services at the Hong Kong Observatory”, training data can be collected from global weather data, remote sensing data (satellite, radar, lidar etc.), weather photos, local observations (since 1883), which means in all daily total data volume is more than 20 TB. However, this is only meteorological data. If we take account of non-meteorological data as well, such as traffic data, incidents or news, flight data, TV weather programs, the total data volume is more than 90 TB. Therefore, the data fusion is vital for ANN learning [15–17].

Also in the seminar of “To diagnose a community using artificial intelligence: Assessing the primary, secondary and tertiary prevention needs of the residents and how their built environment puts them at risk of COVID-19”, the lecturer use sequential datasets collected from various aspects and put into Long Short Term Memory [18] and Convolutional units [19] to deal with the data to output the results rating of buildings’ hygiene risk levels.

Furthermore, we can use neural network to deal with various image data. Nowadays CCTV is more and more popular. Actually we can record real time video by cameras installed at various places and use deep learning artificial neural network to detect people poses, such as walking, running, falling down, asking for help, and so on. We can use open source OpenPose model to complete these tasks [20–22].

Intro of OpenPose

OpenPose model was proposed by researchers of Carnegie Mellon University [23–25]. The model uses VGG19 backbone to extract features [26–28]. Then two extra convolutional layers conv4_3 and conv4_4 are added. After that initial and 5 refinement stages are made. OpenPose model can detect 15, 18, 25 key points of whole body, that is left elbow and left wrist, right hip and right knee, left eye and left ear, etc. The final Average Precession is about 48.6% using the COCO validation subset.

Figure2. 18 & 25 keypoints body skeleton detection

Below are the basic ideas to implement OpenPose. Convolutional Pose Machine model can realize single person pose estimation. The algorithm workflow is as this: (1) First, perform regression on all the people appearing in the image, and find out key points of each person’s joints. (2) Then remove the response to other people according to the center map. (3) Finally, the final result is obtained by repeatedly refining the predicted heatmap. When performing refinement, it is necessary to introduce the loss of the middle layer, so as to ensure that the deeper network can still be trained, and the gradient will not diffuse or explode. This kind of idea is very good, because it adopts gradually improve the accuracy of regression through coarse to fine.

OpenPose model can also realize multi-person pose estimation. Similar to Convolutional Pose Machine model structure, the network finds out all the joint points in a picture and uses part affine field (PAF) to classify. Why to use PAF? Because the method in the article uses a bottom-up approach, that is, first return of everyone’s joint points, and then divide these joint points, so that the joints can be assigned to everyone. As for how to divide, it is divided by PAF networks.

However, when we use OpenPose to do experiments, we find one disadvantage is that it very difficult to detect people from complex background. For example, some people may be partially hided in the video sometimes. In one word, OpenPose cannot tract them accurately. Moreover, OpenPose has another disadvantage. It needs powerful computation resources, like GPU. Hence, when we are running OpenPose, we should put camera at right place avoiding people hiding, and we should use powerful GPU to support real time response. Here we find another model, named as Lightweight OpenPose [29] can run on CPU but has little accuracy loss.

Lightweight OpenPose has done some improvements. Firstly it uses Lightweight Backbone. MobileNet v1 replaces VGG elements to extract features and dilated convolution is introduced. Secondly it uses Lighetweight Refinement Stage. Only one stage of refinement stage is kept and single prediction branch is used for initial a refinement stage to deal with keypoint heatmaps and PAFs. And 7*7 convolutional kernel is replaced by a series of 1*1, 3*3, 3*3 kernel size. As a result, the proposed model can get 41.4% average pression which is about 1% loss compared with the model kept with all refinement stages. The result is quite acceptable. By using proposed Lightweight OpenPose, the experiments can produce 28 frames per second (fps) on Intel® NUC 6i7KYB mini PC and 26 frames per second on Core i7–6850K CPU.

Figure5. AP comparison with different refinement stages

Accelerated approaches

What’s more, in general below are several commonly used methods to reduce the complexity of ANN and accelerate inference.

Lightweight Network Design: refers to the redesign of a new network structure, rather than just optimizing and accelerating the existing network. The main representative lightweight network design are MobileNet v1, v2 [30, 31], or ShuffleNet v1, v2 [32, 33], etc. The main idea is to use Depthwise Convolution, Pointwise Convolution, Group Convolution and other less-operations, some convolution to replaces the standard convolution. The calculation amount of these types of models are usually only about some MFLOPs, which have obvious advantages compared with some GFLOPs of traditional large networks, such as VGG [26–28], Inception [34], ResNet [35], etc., and at the same time, it is relatively simple. The accuracy loss is much small when dealing with simple tasks.

Model Pruning: compared with lightweight network design, model pruing mainly focuses on subtracting existing models from large scale to small scale. The main idea is to design a certain sift mechanism (sparse), under the premise of keeping the accuracy of the existing model, and to filter out the less important weights in the convolutional layer on a certain scale, so as to achieve reduction computational resource consumption and improving real-time performance [36–41].

Model Distillation: the goal is to make the large model gradually smaller while keeping the accuracy loss small. Model distillation is to use the supervised features provided by the large model (Teacher Network) to help to form the small model with a small amount of calculation (Student Network) to achieve accuracy similar to that of a large model, thereby achieving model acceleration. The key to model distillation lies in the design of supervised features, such as using the class similarity by Soft Target [42], or using the middle layer feature map [43] or attention method [44], to train student network.

Matrix Decomposition: since the most computationally intensive part of the deep learning model is convolution, and convolution can be achieved by matrix multiplication after im2col, we can use a variety of matrix low-rank approximation methods [45, 46]. The multiplication operation of a large matrix is disassembled into a series of multiplication operations between multiple small matrices, which reduces the overall calculations and accelerates the execution speed of the model.

Quantization and Low-Precision Operations: deep learning models need to perform a large number of floating-point multiplication and addition operations during the running process. Generally, the default data bit width is 32bit, but in fact, we can use a lower bit width, such as 16bit, 8bit, 4bit, 2bit or even 1bit, to quantize the weights and feature maps of the model to complete approximate calculations [47–49]. In this way, on the one hand, the memory access during model runtime can be half decrease, and on the other hand, with the support of high-speed hardware, the calculation speed of the model can be double increased. The core is on how to control the accuracy loss of models caused by low accuracy floating-point operations.

Computational Graph Optimization: the number of layers of a deep learning models are usually from tens to hundreds of layers, but in fact, there are many parts between layers share similar structures, such as Conv-BatchNorm-ReLU. Hence we can optimize these fixed similar structures, and suppress the multiple layers into one layer horizontally and vertically, in order to reduce unnecessary memory copies between layers and the overhead caused by multiple kernel launches.

Convolution Algorithm Optimization: convolution itself has a variety of algorithm implementation methods, such as sliding window, im2col + gemm, FFT, Winograd convolution, etc. These convolution algorithms do not have absolute advantages or disadvantages in terms of speed, because the efficiency of each algorithm largely depends on the size of the convolution operation. Therefore, when optimizing the model, we should select the most efficient convolution algorithm for each convolution layer of the model, so as to make full use of the advantages of different convolution algorithms.

Hardware Acceleration: operations of any model need to rely on a certain computing platform to complete, so we can directly accelerate from the hardware design of the computing platform [50]. At present, the mainstream computing platform for deep learning models is GPU. Starting from the Volta architecture, GPU is equipped with a hardware computing unit Tensor Core dedicated to fast matrix multiplication operations, which can significantly improve the throughput of deep learning models. At the same time, FPGA and ASIC acceleration chips featuring low power consumption and low latency attract interest of edge computing.

References

1. Anthony, M.; Bartlett, P. L., Neural network learning: Theoretical foundations. cambridge university press: 2009.

2. Wang, S.-C., Artificial neural network. In Interdisciplinary computing in java programming, Springer: 2003; pp 81–100.

3. Schmidhuber, J., Deep learning in neural networks: An overview. Neural networks 2015, 61, 85–117.

4. Zhang, A.; Lipton, Z.; Li, M.; Smola, A., Dive into Deep Learning Release 0.7. 1. Springer,(20 I 9): 2020.

5. Hornik, K.; Stinchcombe, M.; White, H., Multilayer feedforward networks are universal approximators. Neural networks 1989, 2 (5), 359–366.

6. Hornik, K., Approximation capabilities of multilayer feedforward networks. Neural networks 1991, 4 (2), 251–257.

7. Hornik, K., Some new results on neural network approximation. Neural networks 1993, 6 (8), 1069–1072.

8. Hornik, K.; Stinchcombe, M.; White, H., Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks 1990, 3 (5), 551–560.

9. Abbe, E.; Sandon, C., Provable limitations of deep learning. arXiv preprint arXiv:1812.06369 2018.

10. Nye, M.; Saxe, A., Are efficient deep representations learnable? arXiv preprint arXiv:1807.06399 2018.

11. Eftekhar, B.; Mohammad, K.; Ardebili, H. E.; Ghodsi, M.; Ketabchi, E., Comparison of artificial neural network and logistic regression models for prediction of mortality in head trauma based on initial clinical data. BMC medical informatics and decision making 2005, 5 (1), 1–8.

12. Hu, C.; Zhao, F. In Improved methods of BP neural network algorithm and its limitation, 2010 International Forum on Information Technology and Applications, IEEE: 2010; pp 11–14.

13. Khashei, M.; Bijari, M., An artificial neural network (p, d, q) model for timeseries forecasting. Expert Systems with applications 2010, 37 (1), 479–489.

14. Sum, J.; Leung, C.-S.; Ho, K., A limitation of gradient descent learning. IEEE transactions on neural networks and learning systems 2019, 31 (6), 2227–2232.

15. Bleiholder, J.; Naumann, F., Data fusion. ACM computing surveys (CSUR) 2009, 41 (1), 1–41.

16. Goodman, I. R.; Mahler, R. P.; Nguyen, H. T., Mathematics of data fusion. Springer Science & Business Media: 2013; Vol. 37.

17. Hall, D. L.; Llinas, J., An introduction to multisensor data fusion. Proceedings of the IEEE 1997, 85 (1), 6–23.

18. Yu, Y.; Si, X.; Hu, C.; Zhang, J., A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 2019, 31 (7), 1235–1270.

19. Wu, J., Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China 2017, 5, 23.

20. Chen, W.; Jiang, Z.; Guo, H.; Ni, X., Fall detection based on key points of human-skeleton using openpose. Symmetry 2020, 12 (5), 744.

21. Noori, F. M.; Wallace, B.; Uddin, M. Z.; Torresen, J. In A robust human activity recognition approach using openpose, motion features, and deep recurrent neural network, Scandinavian conference on image analysis, Springer: 2019; pp 299–310.

22. Viswakumar, A.; Rajagopalan, V.; Ray, T.; Parimi, C. In Human Gait Analysis Using OpenPose, 2019 Fifth International Conference on Image Information Processing (ICIIP), IEEE: 2019; pp 310–314.

23. Wei, S.-E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. In Convolutional pose machines, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016; pp 4724–4732.

24. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y., OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence 2019, 43 (1), 172–186.

25. Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. In Hand keypoint detection in single images using multiview bootstrapping, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017; pp 1145–1153.

26. Mateen, M.; Wen, J.; Song, S.; Huang, Z., Fundus image classification using VGG-19 architecture with PCA and SVD. Symmetry 2019, 11 (1), 1.

27. Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K., Going deeper in spiking neural networks: VGG and residual architectures. Frontiers in neuroscience 2019, 13, 95.

28. Yu, W.; Yang, K.; Bai, Y.; Xiao, T.; Yao, H.; Rui, Y. In Visualizing and comparing AlexNet and VGG using deconvolutional layers, Proceedings of the 33 rd International Conference on Machine Learning, 2016.

29. Osokin, D., Real-time 2d multi-person pose estimation on CPU: Lightweight OpenPose. arXiv preprint arXiv:1811.12004 2018.

30. Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H., Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 2017.

31. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. In Mobilenetv2: Inverted residuals and linear bottlenecks, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp 4510–4520.

32. Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. In Shufflenet v2: Practical guidelines for efficient cnn architecture design, Proceedings of the European conference on computer vision (ECCV), 2018; pp 116–131.

33. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. In Shufflenet: An extremely efficient convolutional neural network for mobile devices, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp 6848–6856.

34. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. In Rethinking the inception architecture for computer vision, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp 2818–2826.

35. He, K.; Zhang, X.; Ren, S.; Sun, J. In Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp 770–778.

36. He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.-J.; Han, S. In Amc: Automl for model compression and acceleration on mobile devices, Proceedings of the European Conference on Computer Vision (ECCV), 2018; pp 784–800.

37. He, Y.; Zhang, X.; Sun, J. In Channel pruning for accelerating very deep neural networks, Proceedings of the IEEE International Conference on Computer Vision, 2017; pp 1389–1397.

38. Li, H.; Kadav, A.; Durdanovic, I.; Samet, H., Graf HP Pruning filters for efficient convnets. arXiv preprint arXiv 2017, 1608.

39. Han, S.; Pool, J.; Tran, J.; Dally, W. J., Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626 2015.

40. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. In Learning efficient convolutional networks through network slimming, Proceedings of the IEEE International Conference on Computer Vision, 2017; pp 2736–2744.

41. Ye, J.; Lu, X.; Lin, Z.; Wang, J. Z., Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124 2018.

42. Hinton, G.; Vinyals, O.; Dean, J., Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2015.

43. Adriana, R.; Nicolas, B.; Ebrahimi, K. S.; Antoine, C.; Carlo, G.; Yoshua, B., Fitnets: Hints for thin deep nets. Proc. ICLR 2015.

44. Zagoruyko, S.; Komodakis, N., Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 2016.

45. Jaderberg, M.; Vedaldi, A.; Zisserman, A., Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 2014.

46. Zhang, X.; Zou, J.; He, K.; Sun, J., Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 2015, 38 (10), 1943–1955.

47. Courbariaux, M.; Bengio, Y.; David, J.-P., Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363 2015.

48. Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. In Xnor-net: Imagenet classification using binary convolutional neural networks, European conference on computer vision, Springer: 2016; pp 525–542.

49. Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y., Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 2017, 18 (1), 6869–6898.

50. Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M. A.; Dally, W. J., EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 2016, 44 (3), 243–254.

Intro of Posture Detection, & thinking more on OpenPose I

Background Review

Intro of OpenPose

Accelerated approaches

References

Written by Henry Heng LUO