Your Guide to Human Movements In Machine & Deep Learning

Published in

CodeX

11 min readSep 8, 2021

This article speaks to both business cases and the developer approach for working with human movement content in machine and deep-learning computational environments. We define human movement content as any human movement, activity, or behavior captured in a video stream, video camera, depth sensor, or by wearable sensors worn at joint locations of interest. Note, this guide focuses on generating information from human movement content using human joint positions as data. This data approach is globally aligned with privacy protection policies.

For most of us, our human body movements and activities are an integral part of our lives. Human body movements are our fitness exercises for health management, competitive sport, or recovery from injuries. Human body movements are for lifting and carrying our child, or helping an elderly neighbor carry grocery bags from the car. The human body in motion executes our daily lives. In doing so, our human body movements contain useful information about our body in motion at work or play, in pain, or joy. How do we capture this information to learn more, help more, or create safer environments for all?

The evaluation of our body movements generally is around how ‘well’ we feel moving throughout the day. From a business perspective however, professionals like physical therapists, trainers, athletes, doctors, safety managers, manufacturing floor operators, and many others, need quantitative information about human movements. In these fields and others where human life is impacted by human decision-makers, accurate and precise data is fundamental.

And herein lies the value of employing advancements in computing, computer vision and AI: the ability to generate quantitative information at scale about the human body in motion. With AI we can see the unseen, quantify what has been unquantifiable, and manage human movement as data, like any other set of data. AI is required since movement is so complex. Body movement generates tens of thousands of data points a minute. Remember, you move through 3 spatial planes (length, width, and depth) and across a fourth plane, time, all day long. Only AI can harness this data and produce volumes of information.

For example, a month in the life of a physical therapist may result in over 24,000 minutes of human movement content over 50 patients. Per patient, a PT’s objective is to improve patient outcomes within a prescribed schedule (let’s say, 2x a week for 8 weeks). With AI, a PT can use computer vision and AI algorithms to capture a baseline of movement, measure a change in a patient’s movement progress, or the lack of progress. With AI, the PT and professionals like the PT, can help more patients by being able to manage and generate objective, data-driven patient outcome information at the click of a button. The same is possible now for any clinician or practitioner responsible for human performance in health, sports, or fitness.

Beyond health, sports, or fitness, human activity content is everywhere! Other industries involved with the human factor, include safety and security operators monitoring thousands of video cameras where they need notifications only for emergent human activities or behaviors; or line managers responsible for worker safety and task performance in a manufacturing environment; or retail stores seeking to understand a customer’s product and staff engagements.

The Human Movement Technology Value Chain

Now that you have context about using AI to quantify human activity content, let’s look at the current technology value chain for Human Movement and Activities. Here is where we start:

Diagram: Box 1 Human Movement; Box 2 represents the Application; Box 3 represents the End User; Each box connected by arrows moving right to left between boxes.

First, human movement content originates from video streams captured by cameras, depth cameras and sensors, wearable sensors, and recorded videos. This data (human movement content) moves to a software application designed to deliver value to the end user. End users here could be anyone making decisions about the health, safety, or movement performance of a person.

At first glance it may seem simple, like any other software build, and you may be wondering, ‘Okay, what’s the problem?’ Well…. Several challenges exist in the realm of computing human movement content:

1. Human movement content, within videos or camera streams, must first be identified as human by using general features of a body: a head, shoulders, two arms, hips, legs, feet, and standing upright to move forward. We need software to identify these joint locations on a body.

2. Videos and camera streams are crowded with a lot of other objects in the scene that makes extracting body joint locations a complicated challenge to be addressed within the application. Therefore, the technology value chain includes, “Pose Estimators”, AKA “skeletal trackers” AKA “body landmarks”. They appear as a middleware software library which is mainly dependent on a pre-trained ML and DL algorithms for skeletal tracking and joints positions extracted from different video frames. Well-known Pose Estimators in the market today are Wrnch, OpenPose and Azure Kinect body tracker.

Now we have amended our diagram to include this vital piece of software.

Diagram: Box 1 Human Movement; (new) Box 2 Pose Estimator; Box 3 Application; Box 4 End User; Each box connected by arrows moving right to left between boxes.

“Great! Can we build human movement applications for our End Users now?” Not yet.

Pose Estimators successfully provide different human joints positions as data in the camera environment, but now we have all kinds of new issues:

1. Pose Estimators do not generate information and/or knowledge that can be used to create an understandable data story to our End Users. Pose estimators don’t perform tasks that identify specific movements, or evaluate body movement performance such as, the best way to perform a specific exercise, both of which may play a role in the data story for an End User.

2. Pose Estimators do not tell you about the body movements that happen in a sequence over time.

3. Pose Estimators do not find the start and end of a movement in a sequence. For example, counting repetition of movements or activities.

4. Pose Estimators do not normalize joint locations to fit your application requirements. Normalizing body joint data is necessary since our body joint locations are different based on length. If an End User wants to know which joints or movements are incorrect compared to a baseline, both the skeletal data coming into the application and the baseline skeletal data, must have commonality, i.e., be normalized.

Now the Technology Value Chain requires another component: AI-based Analytics. These challenges left after the pose estimator, can only be solved by building and training ML & DL algorithms. And these algorithms need a data structure to address these space and time considerations for the human body in motion.

If you are a developer, perhaps you’ve experimented with human movement. Say you recorded yourself jumping twenty times. You want to create a demo application where it can count the number of jumps in the video automatically. You may have used an OpenPose library to generate your joint locations from the video, then after this step you got stuck because you couldn’t find an algorithm to train on such kind of data or you found an algorithm, but you are not able to automatically detect the start and the end of each jump. How can the application count what you have not instructed it to see or define ‘count’?

This rather simplistic example describes such kind of technical problems with human movement. Once that body starts moving in a sequence, information and knowledge from the human skeleton data needs AI analytics to manage the volume and velocity of human content data.

Diagram: Box 1 Human Movement; Box 2 Pose Estimator; (new) Box 3 AI-Analytics; Box 4 Application; Box 5 End User; Each box connected by arrows moving right to left between boxes.

4D Analysis Vs. 3D Analysis

Let’s look at types of analytics, the differences between them, and which type of analysis are appropriate for your software goals.

Data is the first step in any kind of analysis. Understanding it will clarify our algorithms’ mix selection strategy. The question here is, “What does human movement data look like?” As the Technology Value Chain mentions, each frame within the video is converted into joints positions using a Pose Estimator. A single joint position is <X, Y, Z> where X, Y, and Z are real numbers indicating the joint position within the camera’s spatial environment. Some Pose Estimators generate only <X, Y> for a joint position depending mainly on the video type used with the Pose Estimator. Pose Estimators don’t only find the location of a single joint but multiple joints, collectively is called skeleton landmark features. One good idea for organizing those joints positions in a data structure is the matrix format, where each row represents a frame, and each column represents a specific joint position e.g., cell (1, 2) could be Right Wrist position <X, Y, Z> within video frame at t1. We can call each of those matrices a Sequence.

Fig 1 Single Squat Frame (row) Visualization

Fig 2 Squat Sequence (Matrix) Visualization

It’s of a great importance here to notice that there is a relationship between those joints’ positions (columns) i.e., parent-child relationship or body bones. Also, each frame (row) is related to the frames before and after it. To produce repeatable and reliable analysis, you need to have a specific order for your data structure that never changes.

One analysis type is called 3D analysis in which you are comparing/classifying a frame to another frame regardless of the time order of these frames for a specific purpose. A common example is to indicate joints’ errors compared to the best possible positions for a specific exercise as described in the pictures below. In these visualizations it is clear person 1 in Fig4 didn’t push their hips down all the way compared to the benchmark in Fig3.

You can use a simple Euclidean distance measure between prospective joints, but results will make sense only if you were able to normalize both skeletons, so the body size and length do not influence the results. Alternatively, you can use classical machine learning algorithms like SVM, Logistic regression, KNNs, … etc. to find the differences between both frames. A 3D analysis approach — regardless of the algorithm mix that you are going to use — is great when the movement you are trying to extract information that is only a single pose or frame, like holding a yoga pose for example. On the other hand, if the movement is represented by multiple poses or a matrix, you must consider time as a fourth dimension that needs to be a part of the analysis. Fig2 is a good example, a squat is not a single pose, but multiple poses connected by a time relationship.

4D analysis is a mix of algorithms which consider the time factor with the position factor. This analysis is exponentially more complex than 3D analysis. Complexities include:

1. Limited choices of algorithms that can handle time relationship. The categories of ML/DL algorithms that can process sequence and multidimensional time series data include LSTMs, Transformers, Hidden Conditional Random Fields, and Hidden Markov Models. Each of those algorithms and prospective models is very complex and requires a huge hyper-parameter estimation.

2. Absence of pre-trained algorithms that can understand time relationship between poses.

3. Absence of a good data sets to test your approach for accuracy and precision.

4. Absence of labeling and human movements data management tools where you can build your own data sets and train your own classifiers and algorithms.

I bet you are asking, “If 4D analysis is that complex, why don’t we stick with 3D analysis?” The goal of any kind of analysis is to show and organize data in a way that can tell the story without or with little ambiguity. 3D analysis is missing a crucial factor for TIME. 3D will do well for cases where your movement is only one pose without a time factor. Ignoring the time factor with multiple pose movements will lead to misleading and incomplete information.

A good example is comparing bio-mechanical similarity between incoming squat movement as data compared to a squat used as a performance benchmark movement. The squat start and end poses look the same in most of the cases as in fig5. Also, each other pose after the starting pose has an equivalent pose before the end pose, as in fig5 pose 1 and 3. The full squat sequence requires you to start from a standing position then go down, then go back up to a standing position.

Fig 5: Sequence of images depicting a full squat movement sequence starting in a standing pose, moving to squat pose 1 mid-way down to squat pose 2 which is flexion of knee at 90–100 degrees, then back up to squat pose 3 mid-way up to ending squat pose standing.

On the other hand, let’s design a new imaginary squat as in fig 6 where the person is starting only from the lower position, moving up to a standing position. By using a 3D analysis to compare fig5 squat to fig6 squat, erroneously a very high similarity score will result. This erroneous similarity score is generated because most of the poses in fig6 squat can be found in fig5 squat even if the time order is not correct. 4D, or including the time factor, is the only way to generate more accurate information and knowledge about the correct performance of the entire squat as a sequence of movement.

Fig 6: Sequence of images depicting a false or imaginary squat movement sequence incorrectly starting at knee flexion 0f 90–100 degrees then, moving mid-way up to ending imaginary squat pose standing. Below is GIF demonstrating the action

linedanceAI’s Position within Technology Value Chain

Diagram: Box 1 Human Movement; Box 2 Pose Estimator; (new) Box 3 linedanceAI 4D analytics; Box 4 Application; Box 5 End User; Each box connected by arrows moving right to left between boxes.

linedanceAI is a 4D human movement analysis platform, a complete solution engine for human activity recognition, and full sequence analysis. We tailored new algorithms mix and tools which specifically address the time issue, and other 4D analysis challenges mentioned herein, to create an AI analytics tool set that generates comprehensive and accurate information from human movement content.

linedanceAI’s algorithms can be retrained for your use cases. The stack is optimized for scale, speed, and flexibility. We do this through low data requirements and no hyperparameters configuration. Stack flexibility accommodates multiple use cases and allow fast deployments to your market. linedanceAI’s out of the box algorithms can be pipelined for specific use cases like healthcare, fitness/sport, security camera monitoring, and workplace safety management.

linedanceAI provides a data structure SDKs built specifically to handle different skeleton structures and different joint numbers to match dynamic requirements within a use-case. The data structures are totally integrated into each single algorithm within the algorithm mix and pipelines.

linedanceAI’s patented algorithms and its 4D analysis approach has been built and tested with a group of advanced data scientists, machine learning engineers, and human movements analysis PhDs to turn our 4D world into digital stories.

Want to run a project? Or to learn more about human movement in ML & DL and our latest news, follow our linedanceAI page on LinkedIn, watch our demo videos there, or here. Better yet, reach us directly at info@linedanceAI.com . TIME is on OUR side. Is it on yours?

Your Guide to Human Movements In Machine & Deep Learning

The Human Movement Technology Value Chain

4D Analysis Vs. 3D Analysis

linedanceAI’s Position within Technology Value Chain

Written by linedanceAI