Improving your Tennis Game with Computer Vision

Published in

Bakken & Bæck

6 min readDec 2, 2020

Digital image- and video interpretation, known as computer vision, has developed a lot over the last decade. Companies like Sportlogiq use the technology to track player movements and ball trajectories, providing coaches and players with a new type of analysis and deep insights. On a recent project, we explored computer vision further by developing a fully digital tennis coach.

The objectives of the project were straight-forward on the surface. Tennis players should be able to record themselves, upload the video, and receive an analysis of their game. More precisely, they should get feedback on the things that are most important to get right as a tennis player: timing of the swings relative to the flight of the ball, body positioning and angles, and the curve of the swings relative to that of professional players.

In this article we will outline how we built a fully functioning prototype which could do all of this, and more.

Recognising body movement

Firstly, we needed to find a method for recognising body movement and poses. ‘OpenPose’ is the state of the art within human pose estimation, so it was a natural starting point for our prototype.

We used video as a means to recognise body movements, focusing on each frame of a given sequence. In order to get a detailed interpretation of people in an image or a video, their body parts must be identified by some key points.

Key points, in this case, refer to body parts, i.e. joints and limbs, more than thirty in total. To obtain the key points we used Convolutional Neural Networks (CNN), as this is one of the most dependable ways to obtain this kind of data. CNNs are quite similar to a simple artificial neural network, except the CNN architecture is designed for visual input.

We then inferred the pose by combining information we got from a) inspecting key points on the body, and b) looking for a possible connection between them. From here, there are a couple of ways to construct and estimate the pose. The one we chose is based on a tree-based graphical model, which describes the relationship between adjacent joints using rules of human body mechanics.

Once you have the pose, you can do all kinds of things like estimating swing speed, joint angle, etc. If you are really curious and want to read more, we recommend this article.

Estimating pose with multiple people

Multi-person pose estimation is a lot trickier to solve than with only one person, as each input frame would have double or more key points/joints. Mapping them to a human pose is not an easy task (even for human eyes, let alone computers). There are two common strategies to deal with it.

The so-called top-down approach uses object detection first, and then employs the single person pose estimation for each identified object (person). The problem with this approach is that there are no fallbacks if the detection fails, as it easily does if a person is in close proximity. The duration of the runtime will also be proportional to the number of people in the image, making it potentially very slow.

In a bottom-up approach, we first find all the relevant key points, before they are connected to the correct body. The disadvantage here as well, depending on the amount of key points, could be a very long processing time for a single image.

Tracking the racket

Our work with human pose estimation gave us satisfactory results, but for this specific prototype we needed more. For tennis players, one of the most important techniques to master is learning how to swing the racket. It’s crucial to learn how to stand, and where to start and finish the swing for a perfect shot, so we decided to track the racket movement.

This can be done by re-training an existing model with annotated images of tennis players holding rackets. Annotating the number of images required for re-training, however, proved to be quite challenging — so we had to find a different approach.

Watching tennis games closely, we realised that the wrist and base of the racket move together. The library we’re using provides a full list of all the key points and associated joints. This information is useful but not complete for our use case. With some customisation to the ‘tf-pose-estimation’ library, we could use the wrist as our foundation for tracking the players’ swing.

As we processed one frame at a time, we needed to get information from the last processed frame to track the swing. We made some changes to the frame processing module (where key points and joints are identified) so we could store the state of the last processed frame. By combining the skeleton information, consisting of key points and joints, in the previous and current frame, we could easily track the movement of the wrist.

Swing like Serena

We took the prototype a step further. By analysing the swings of professional players, we could build a feature where the user compares their own swing with the pros, and learns how to improve. To make this work we had to make sure that the user’s video was recorded in a way that the key points required were visible. We spent a lot of time researching the best spot for this, to make it work as well as possible.

In tennis there are certain shots that are most commonly used, e.g. forehand, backhand, serve, etc. We wanted to effectively extract and analyse these swings in post processing. This can be achieved by making further customisation to the library. We already had changes in place to keep track of the last estimated pose, and by combining information from the current and previous state we can extract the swing. For this to work, we need to check the relative position of the wrist to shoulders, hips and knees. Knowing if the player is right or left-handed, we can accurately tell if a particular swing is forehand, backend or a serve.

For the user to be able to compare their swing with a professional player’s, we need to consider the type of swing, the player’s height, handedness etc. Swing type and handedness we’ve already covered. To solve the issue with differences in height, we start both the user’s and professional’s swing at the same coordinate, which makes a swing easily comparable and provides valuable feedback to the user.

Processing and feedback

After the user uploads the recorded video, the processing is done on the server. For a 30 seconds long clip, which is more than enough for a forehand swing, we were able to process and return the output within a minute.

Ideally, we would want real-time feedback. So-called on-device detection and estimation is a great way to do that. With the release of more and more powerful mobile devices, it’s now possible to use machine learning models directly on the phone. We looked into using tensorflow lite, which is built for mobile devices, but after some testing with ‘tf.js’ for on-device pose estimation, we decided to abandon it. The results were simply not reliable enough to give the player any useful feedback.

Here’s an example of a forehand analysis. The analysis gives the user feedback on both pose and swing:

Further improvements

All in all we were quite happy about what we achieved with the prototype. A few things could be improved though, like on-device detection to improve user onboarding and camera placement calibration. We would also like to expand our analysis with more swing types, like serve, drop shot and volley.

Throughout the project we learned lots about the current state of computer vision. The power of today’s devices combined with accessible open libraries present opportunities we didn’t have just a few years ago. We are eager to continue exploring this technology further going forward.