A Deep Learning Experiment for Quarantine Home Gyms

Co-authored with Ojasvi Yadav

Hetarth Chopra
Geek Culture
8 min readJun 17, 2021

--

In 2020 a large amount of the population had to be immured within the four walls of their home. The Quarantine opened up problems that were previously not thought of, and the inability to go to the gym was one. 46% of Indians were affected by the nationwide lockdown because of limited accessibility to gyms, parks, and fitness centers. Thus, a lot of fitness circles moved to online video conferencing platforms, and workout from home became prominent.

With the scale of these classes increasing exponentially every day, it is challenging for the trainers to pay attention to every student enrolled in the class individually. Especially when some exercise involves numerous repetitions, it is difficult for the trainers to keep a count and assess the enrolled students properly. It called for the development of tools to cater to the analytic and real-time performance tracking of such classes and similar scenarios. In this blog we have discussed one such use case — Counting of Pushup Repetitions with the help of Deep Learning.

Since it is well established that deep learning can assist to ease out and automate a lot of computer vision problems, finding the right architecture for the pipeline, was the first question we had to tackle. Now, usually, such problems are solved using advanced computer vision techniques such as human pose estimation, optical flow, etc. Doing a pose estimation of the joints by perhaps using libraries such as Tensorflow’s Pose Estimation library is useful, but since it requires high computation power, it might not be plausible to use it in mobile-based deployments. But since we wanted this experiment to be very user-friendly, basic, and flexible, so we chose to go with traditional classification pipelines. The algorithm should be capable of classifying the pushup video’s frames into different stages of a pushup, as shown in Fig.1, which can be appended into a list of events (….down->up->down….), hence acting as a counter.

UPDATE- This blog is not a follow-through/code-along one, even though it might seem. View the conclusion of this blog in order to check our learnings prior to following up with the rest of the discussion.

Fig 1. A simple illustration depicting the classification of frames into notable events.

1 — Selecting the Model and the Framework

For this particular case, it was primarily necessary that the object detection algorithm was fast, precise, and light enough to even deploy on handheld devices. Popular algorithms in the regime of fast deployable models include Fast-CNN, YOLO, SSD, and MobileNet, as per a really comprehensive comparison by Jonathan Hui. Along with this, most of the compared models were a part of the Tensorflow 2’s Detection Model Zoo, so we proceeded to use Tensorflow 2 for the framework. Carefully comparing a lot of model’s performance metrics on the COCO Dataset, and their ease of deployment and flexibility, SSD-MobileNet V2 was chosen to proceed with. The Net’s research paper can be read here.

Particular pros for using this model include fast and accurate prediction even on mobile devices and can be easily used with Tensorflow for transfer learning applications. Transfer learning is an innovative methodology for decreasing the training time by using a pre-trained model (often trained on a big dataset such as ImageNet) and re-training the last ’n’ (configurable parameter) layers for fine-tuning. This lets the basic feature recognizers in the initial layers of the NN work exactly how they should while influencing the weights of the final layers to cater to our particular dataset. Another advantage of using transfer learning is that it helps in getting the desired result with fewer images. In our case also, we decided to follow the tutorial for transfer learning provided by TensorFlow as shown in this link.

2 — Getting and Setting the Data required for Transfer Learning

Before transfer learning could be initiated, we had to find a dataset that could cater to our need of classifying the stages of a push-up. Instead of looking to external sources, we recorded ourselves performing pushups. The videos were very simple recordings taken from our phone camera, and the frames were extracted from it using a very basic Python-OpenCV Script as shown.

On looking at the extracted frames, we concluded that it would be better to divide the frames into three events, down, else, and up, as shown in figure 2. The labels were decided purely on visual examination.

Fig 2. Sample Images were taken from the video stream and labeled into the three respective categories purely based on visual examination.

After we extracted the data, we manually (feel free to explore ways to annotate data here) segregated the data into their respective labels and placed them into three categories as shown below. The folders were stored in another folder namely as “labeledframes”

Fig 3. Labeled Folder structure for all the extracted frames from the pushup’s video, to ease out the CNN Training Process.

Keeping the data as-is is a highly inefficient way to build a project or a machine learning application, especially when you are working on cloud machines such as Google Colab/Kaggle. Hence we decided to serialize the dataset using Pickle. Serializing or “pickling” the data can help you to convert the python object into a character stream, which can be de-serialized by pickle and is useful for sharing objects. Thanks to Sentdex, we were able to pack the dataset into a serialized object using the below approach. Along with serializing, we decided to resize the dimension of the images to 100 * 100 to keep our images in accord with the tutorial.

After this process was completed, it was necessary to divide the dataset, after randomly reshuffling, into a training, validation, and test dataset, in a pre-determined ratio, as it is shown in Fig. 4. It was realized using Sklearn’s train_test_split function (performed twice) and was further serialized as shown below for further use.

Fig. 4 — A visualization showcasing how the entire dataset has to be split to train a model with less bias and variance. Image Source — https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

3 — Data Pre-Processing

After serializing the data into separate variables, it was read individually and it was reshaped. The shape of the dataset was decided according to the shape of the neural network that has to be trained, as shown in the tutorial.

Another pre-processing step had to be done, namely one-hot encoding. One hot encoding is needed to be done to the label, where the labels are not ordinal in nature. Here the nature of the label is categorical (up, down and else are individually independent of each other). Hence we need a way to make sure that the algorithm knows this difference before it trains on the given data. Below you can see how we one-hot encoded the label before starting the transfer learning.

4 — Perform Transfer Learning

Firstly, the pre-trained MobileNet-V2 (that has been trained on the ImageNet database), is loaded and the batches of the data are created as shown below. Along with this the parameter base_model.trainable = False is used to define that the MobileNet model should not be trained during transfer learning.

Then subsequently, the non-frozen layers were created, and the model was compiled and validated.

The output should be something like this-

20/20 [==============================] - 5s 229ms/step - loss: 2.2518 - categorical_accuracy: 0.2546initial loss: 2.25
initial accuracy: 0.25

And finally, it was trained…. (Keep in mind that we chose the model weights per epoch to be saved as it trained, since it a good practice, to select the best model in case of gradient explosion).

The output is too long to be shared in the scope of this experiment, however, the training and validation accuracy graph vs the epochs of the training can be viewed in Fig. 5.

Fig. 5 Training and Validation Accuracy after Transfer Learning has been done

The overall categorical Accuracy on the validation dataset that we achieved on our dataset came out to be ~ 0.89.

5— Saving and Testing the model

The saved model was loaded, and the model was evaluated on the test dataset using the below code

And the output was…

20/20 - 0s - loss: 0.2301 - categorical_accuracy: 0.9076
Restored model, accuracy: 90.76%

A good accuracy has been achieved on the test dataset (as it is comparable to our validation accuracy), which indicates a model with less bias and variance.

5 — Live Testing on WebCam

To conclude our experiment, we proceeded to create a small script, that would take in the video feed directly from our laptop’s webcam, and then count the total amount of pushups. Firstly we saved the checkpoint created by us in the previous post, and then we loaded it so as to predict the frames directly from the camera as shown below. The code takes an input frame from the camera, runs it through the CNN, predicts the action, and stores it in a list. When we exit the loop, during post-processing the entire list is scanned of events ([2(up),1(else),0(down)]) in that particular order. On finding this pattern, a counter is updated which, is printed at the end, representing the total number of pushups.

On running this script we noticed that by performing pushups, the script was able to identify the frames sometimes, with not too much accuracy, but a lot of other issues previously not forethought by us surfaced, and they have been discussed in the following section.

Conclusion

Despite a very quick and an easy to understand prototype, which worked (sometimes 😆), the Live Testing on the Webcam provided us a learning opportunity to understand the fallacies still left in the project.

1 — Classification algorithms are really good at solving problems quickly, but, different clothes, different backgrounds, and a different setting to the place where pushups are being performed, very quickly disrupt the standard model. In the coming articles, we will experiment with this further with other models such as Semantic Segmentation or Optical Flow in order to isolate the action from the overall frame.

2 — The counting algorithm had not been optimized fully, in this case, to cater to the various edge cases that might exist while solving a counting problem. For example, the speed of the pushup may be fast or slow, depending on the person hence requiring a more robust pattern recognition algorithm, such as the Knuth Morris Pratt Searching Algorithm.

--

--

Hetarth Chopra
Geek Culture

i like to make projects and forget about it, so i am here to welll, document them.