Swing Like a Champ with Computer Vision | Backhand Tracker

Samuel Sung
Geek Culture
Published in
6 min readOct 6, 2021
Photo by James D. Morgan/Getty Images

After watching the Tokyo 2020 Olympic Games, US Open, and ATP tour, I got obsessed with tennis. I did not realize how intense a tennis match can be. The moment of triumph after a triple deuce can’t be any much better. Even though I was watching the games on my soft couch, my palms were sweating like I was actually playing on the field. In order to experience that same feeling in real life, there was no choice but to begin learning tennis.

So far, I would say It has been quite a success. After a few months of learning, my skills have improved vastly — hitting high and low balls with control and with a moderate speed.

However, I am not yet satisfied and have been pondering on ways to improve my game. Surfing tennis matches and tutorials on youtube can get somewhat repetitive — each time you watch, you learn less. I wanted something else, something different that can get me to the next level.

I realized I can integrate computer vision to improvise my tennis game.

And here are a couple of reasons why I chose computer vision:

  1. Computer vision is exceptional in finding key points on human joints.
  2. Numerous open-source are available. I can detect human poses without training a model from the scratch.
  3. I can easily analyze my swing and compare it to others.

Hopefully, this article will be one of many implementations of computer vision on my tennis game. For this article, I will demonstrate a way to track backhand swing motion and analyze how my swing is different from the best in this business, Djokovic.

MediaPipe[1]

There are various pose estimations you can choose from in the open-source community. To mention a few, OpenPose[2], HRNet[3], DeepCut[4], and Deep Pose[5] are widely utilized.

Check out this link for further information on the overview of human pose estimation

For this article, I will be using MediaPipe, which is provided by Google. It offers customizable deep learning solutions for live and streaming media, including face detection, face mesh, pose detection, 3d object detection, and much more. Also, it provides a trained model that can be utilized in a non-GPU environment. In other words, objects can be detected in real-time with a portable device — laptop, smartphone, etc.

From personal experience, the website is well documented for beginners to easily use and implement on their projects.

MediaPipe Pose

MediaPipe Pose is a top-down approach. More specifically, it utilizes a two-step detector. First, it locates the region of interest (ROI) of a human within the image. Then, it predicts the pose landmarks within the ROI using the ROI-cropped frame as input.

I used MediaPipe Pose to locate the 33 joints of a detected player on the film.

MediaPipe Pose 33 landmarks.

Backhand Tracker

You don’t need all 33 joints to track backhand. Since my right hand is the dominant hand, I simply used my left wrist joint to illustrate the motion of my swing as I approach the ball.

The most challenging part of this tracker was to find the starting and ending point of my swing. I mean every swing is different. Swing speed, angle, and height are adjusted to different incoming balls. Therefore, a simple rule-based algorithm won’t be sufficient to track the various types of swings.

The one time I didn’t fell asleep during my physics 101, I remember learning the rate of change. For example, if we know the position of a particle in respect to time, we know the speed/velocity of a particle since we know the rate that the particle is moving. Using the same concept, acceleration can be found as the rate of change in velocity.

Relationship among position, velocity, and acceleration in respect to time. Source: University of Cambridge Department of Engineering.

From the pose detector, the position of my wrist is already given. Using the velocity and acceleration equations and the graphs shown below, I was able to approximate the starting and ending point of each swing.

Graph of velocity and acceleration of my swings.

The graphs were plotted from three continuous swings. We can definitively see a pattern of velocity and acceleration of a swing and can get a grasp on how to pinpoint the starting and ending points of each swing.

To recall the main goal of this post, I wanted to track backhand. More specifically, I was interested in seeing how much I extend or send my racket as I swing forward.

TMI: I am in a habit of shorting my swing, sending balls with less force and making easier for opponents to respond.

Results

The results were retrieved using TensorFlow Lite XNNPACK (no GPU!), provided by MediaPipe. The results are fairly acceptable as it tracks the location of the wrist. The tracked lines are what I expected — forming a hyperbolic line.

A backhand is tracked with a blue line.
Comparison between Djokovic’s and Sam’s backhand profiles.

Please note that my swings were not recorded with slow-mo, while Djokovic’s video is much slower. Hence, you see fewer points on the graph of my backhand projectile.

As you can see on the graphs, my swing projectiles are quite different when compared to Djokovic’s. Djokovic has a smooth U-shaped swing, while I have an L-shaped swing. There are a couple of interesting facts that I can retrieve from the graphs:

  1. Djokovic's swings are more consistent.
  2. Djokovic swings downward right away, while I drag a little bit and then swing downward. This subtle change can mean everything because I can lose power/force during that ‘lag’ zone.
  3. I tend to move my upper body at the beginning of the swing. In other words, my stance is not established before every swing.
  4. My swing is much shorter. Djokovic maintains racquet speed all the way up to the end of the swing. I tend to lose racquet speed much earlier.

Discussion

This was a fun exercise for me to analyze my swing with computer vision. Surprisingly, the analysis gave me more insights than I expected and guided me on some ways to correct my stance, posture, and swing.

One of the downfalls of using computer vision is the quality of data. The quality of data must be ensured. If films are noisy and blurry, the joints jump around in-between frames, leading to inaccurate results. I also recommend filming in slow-motion to get more consistent results.

I am looking forward to adding additional implementations of computer vision to my tennis game. You guys can follow me to get updated!

If you guys have any tips or ideas related to this tracker, please comment below. I would love to hear them.

Thanks, everyone for taking some time to read this post. Happy swinging :D

References

[1] MediaPipe: https://google.github.io/mediapipe/

[2] OpenPose:https://github.com/CMU-Perceptual-Computing-Lab/openpose

[3] HRNet: https://arxiv.org/abs/1908.07919

[4] DeepCut: https://arxiv.org/abs/1511.06645

[5] DeepPose: https://arxiv.org/abs/1312.4659

--

--

Samuel Sung
Geek Culture

AI enthusiast who is currently on the quest for exploring new insights and ideas