Training AI to Score Olympic Events

Could we have an automated scoring system as a trusted, impartial second opinion?

Suhyun Kim
6 min readJul 30, 2021
https://unsplash.com/photos/ZIoi-47zV88

The 2021 Tokyo Olympics are here. I have been waiting for so long, especially since it was postponed due to the pandemic. While watching some of the games, I observed some decisions judges made that I did not agree with. Notably the 2002 and 2014 figure skating results bothered not only me but also millions of Olympics fans. And it made me wonder “can we teach AI to score Olympic games?” If possible, then we could have automated scoring systems as a trusted, impartial second opinion.

What is action quality assessment?

There actually has been research done on this topic for some time to see if a machine can assess how well a person performs an action. This concept is called Action Quality Assessment (AQA). AQA is the assessment of how well an action is performed by estimating a score after analyzing the performance. It is useful in many applications not only in sports but also in health care and music. For example, an injured player or player with mobility impairments would be able to exercise physical therapy on their own without having to pay for expensive physical therapy sessions. Similarly, automated evaluation of skills, for example, can make learning a musical instrument more accessible to those who are socioeconomically disadvantaged.

Action Quality Assessment is not Action Recognition

AQA is different from action recognition because action recognition does not quantify the quality of the action. Also, an important distinction is that an action can be classified from just one or a few images but in order to assess how well an action was performed, we need to go through the entire action sequence.

How do we teach the machine to do it?

We can formulate AQA into the supervised learning problem, where the model learns to map the input videos of actions to an action quality score. Scores given out by human judges serve as our ground-truth. Since, typically, the action quality score has a continuous distribution, AQA is formulated as a regression problem.

The assessment, generally, has the following two components: 1) what was performed and 2) how well it was performed. When a performance is measured in the Olympics, the judges have to recognize what action is performed by athletes. A performance with a higher difficulty score with some mistakes can lead to a higher overall score than the one with a lower difficulty.

How do humans do it?

For example, let’s talk about diving events in the Olympics. Before each diver competes, they show on the screen the name of the dive he’s going to perform and the associated difficulty.

The dashboard contains a lot of information on what will be performed. As far as the body position goes, you can choose from the straight, pike, tuck or free position. The numbers in front of “somersault” and “twists” show how many of each will be performed. All of the fine-grained details of what was done are important when it comes to determining how well a diver performed. The judges are given all this information before the athlete dives. For this reason, learning the right features from the videos is crucial for AQA.

What has been done so far?

An initial study of AQA used human pose information as representations capturing the action quality. Given an image or video, pose estimation is the task of identifying, locating and tracking major joints in the human body such as the elbow or knee. From pose estimation, they extracted the human body movement features from the video and Support Vector Regression (SVR) was used to map athlete’s movements to Olympics event scores.

However, there is a problem with this approach. In the case of diving as the picture shows below, the form is very important in that you don’t want to have a lot of room between your chest and legs. Likewise, the splash created when the diver enters the water is an important part of the scoring criteria. You don’t want to have too big of a splash.

In order to capture such important visual cues in the quality of actions, instead of relying on pose estimation, we can try to capture spatio-temporal features from 3D convolutional neural networks (C3D), which showed promising results in a similar task that is action recognition. The research showed that a variation of C3D worked better than previous models.

To go a little bit further in catching the details from the videos, the multitasking approach was suggested. The concept of multi-task learning (MTL) is inspired by human learning. When we learn new tasks, we often learn simpler or related tasks that give us the necessary skills to acquire more complex techniques. Applying this concept, in order for the model to learn fine-grained details of the diving video, we teach the model to learn (1) how to classify an action, (2) how to generate commentary and (3) how to score the action. The motivation for this is that detailed action recognition is the answer to the question of ‘what was performed’. The commentary, which is verbal description containing good and bad points about action execution, is an answer to the ‘how well the action was performed’ part.

https://drive.google.com/file/d/1BbdPZDZlLwj0ekHJz54o-A2LYvU5h148/view

Model Architecture

It starts with extracting the input video into picture frames. In the example below, the total of 96 frames were extracted from the input. In order to learn shared representations, or general features, 3D CNNs were used but they require large memories, and the entire 96 frames cannot be learned all at once with the 3D CNNs. For that reason, 96 frames were divided into small clips: 6 sets of 16 frames each. They they go through the common network backbone, which learns shared representations. And we take the average of those 6 sets of the frames. So far, we have encoded the input video into representations that correspond to the total AQA points gathered by the athlete. Then the encoded representations go through the score regressor and action classifier. For the caption generator, the concatenated C3D representations are sent as an input without taking the average. At the end weighted 3 losses are summed together to get the final loss.

https://openaccess.thecvf.com/content_CVPR_2019/papers/Parmar_What_and_How_Well_You_Performed_A_Multitask_Learning_Approach_CVPR_2019_paper.pdf

AI-based App to Judge Olympic Events

Based on the multitask AQA model, we created an app that can process Olympics diving videos and output a score, just like human judges would do! Try out our AI-based Olympics Judge here: https://share.streamlit.io/gitskim/aqa_streamlit/main/main.py

For now, this machine learning model is only for diving videos. Nonetheless, it can be easily extended to other individual sports like Gymnastics, Snowboarding, Skiing, etc. The web app is made using StreamLit and its source code is here: https://github.com/gitskim/AQA_Streamlit.

References

  1. https://openaccess.thecvf.com/content_CVPR_2019/papers/Parmar_What_and_How_Well_You_Performed_A_Multitask_Learning_Approach_CVPR_2019_paper.pdf (I have the permission to cite this paper from the author of the paper)
  2. https://www.tutorialspoint.com/diving/diving_body_positions.htm
  3. https://ruder.io/multi-task/

Notes

I started my freelance business in 2022. I accept freelance work focusing on productionizing machine learning services, data engineering and distributed systems.

--

--