Anatomy of a Computer Vision App

My name is Tommaso Borghi and with my friend Tomas Camin we developed a little technological jewel called Tennis Camera.
This is our story and the architecture of our little “child”.

Intro

Tennis Camera is an iPhone App and it’s the first fully automatic mobile product to measure your serve speed — like in professional tennis tournament.
It requires no calibration, works on all light conditions (indoor and outdoor), all court type (clay, grass, hard), on both serving direction, with no need for tripods or extra holders and with an overall speed accuracy below 5%. In real-time.
Just check out the video below or download the Lite version (for free!) or the Pro version:

The Story

Our story started a couple of years ago an a tennis court while recovering from the shipwreck of our first app, Answhere.me.
As for Tennis Camera Anshwere.me was a side project where we poured a lot of energies and efforts, endless nights after work and even higher expectations for a product that never took off and we were licking our scars wondering what to do next.

While leaving the court on one of those “recovery” nights we started talking about this crazy idea of using the iPhone to measure the serve speed in an accurate and reliable way. A professional tool was only available for ~1000$ and the apps on the store were completely manual and based on questionable assumptions (measuring the time of flight of the ball thus computing an average speed rather than the peak as in professional tournaments or very time consuming and far from the realtime experience all tennis fan are used to).
We wanted the real thing: and accurate real time instrument to measure our serve speed.

We immediately labelled the idea as “impossible” and went home. But we felt in our mind we could do something. We quickly started doing some “back of the envelope” calculation and at the end the idea was not so far away. Of course we couldn’t think about the detailed challenges we would have faced but this was the perfect opportunity to be back on business and to seriously challenge our technical skills — our real add value in the crowded app market.

In the post below we’ll try to open the doors of our technology and describe the architecture and the engineering challenges we overcame to realize this.

The Anatomy of the App

As we said the main two challenges to tackle are to achieve good accuracy in real time while playing with some serious constraints like using a hand held device with limited computational power.

We could spend days covering all the tiny details of our implementation but let’s try to keep it simple and explain you how we managed to do all that starting from our architecture.

The speed estimation algorithm is divided in three steps as shown here:

The first step is the “Court Detection and Trigger”. As a matter of fact before starting the entire chain of processing we make sure that the phone is actually shooting a tennis court from the right point rather than a general scene (the “Court Detection”) and more importantly we run a real time algorithm to detect a flying ball (what we call the “Trigger” to all the next processing). We also collect high resolution frames for more accurate processing.

Secondly we perform fine processing on the high resolution frames to accurately detect the ball and its bouncing point in the “Blob Fine Tune” section.

Finally we run all the “Speed Estimation” processing to estimate the camera pose, the ball trajectory and finally it’s initial speed.

Let’s dig into the details of each of these modules.

Court Detection and Trigger

Here it’s where all the dirty job is done. As you can see from the block diagram the first step consists in grabbing the frames from the camera and perform a simple series of preprocessing at GPU level. We use the awesome GPUImage framework by Brad Larson to downscale the original frames, enhance the ball-to-court contrast and reduce movement artifacts by applying some averaging filter.
This preprocessed frame is then subtracted with the previous frame and the result is converted to grayscale and finally to a binary image by using a “Low Pass Thresholding”.
All these operations can be executed in parallel (pixel-wise) and thus are particularly suited to GPU processing, which means REALLY fast. Just to give you some reference number we can run this first set of operations at >30 fps on iPhone 5S and at >60 fps in iPhone 6 and 6+.

From here on everything related to frame processing is handled in the CPU.

Court Detection

In this phase we are interested into two main things:

  1. ensuring that the user is pointing to the right target (a tennis court);
  2. checking that we can detect all the reference points we need to perform the estimation of the camera position and orientation in the court. The logic here is pretty simple: you first recognize the lines and if their pattern matches that of a standard tennis court then move forward.

The frames used in this module come from the “Circular Frame Buffer” which contains the frames originally grabbed by the camera at full resolution (1280x720). To optimize performance we reduce the frame size down to a more manageable 960x540 which provides the best trade-off between speed and accuracy.

Then we proceed with the definition of a Region-Of-Interest (ROI) based on the different color distribution of the court and its outside neighborhood. Once the processing is constrained to the right region we start searching for the lines.

Despite standard computer vision libraries offer off-the-shelf solutions for line detection (like the Hough Transform from OpenCV we preferred to develop our custom algorithm to keep full control of the parameters.

First of all we need to identify where the white parts of the image and to do this we first equalize the image, we select and interval of colors (in HSV Space) and finally perform a blob detection of this areas.

We built a custom blob detection algorithm: of course using well known libraries like cvBlob was our starting point but we soon realize that we couldn’t afford unnecessary features so to squeeze all iPhones resources and increase as much as possible the fps to be processed (more fps means more measurement points, that is higher accuracy).

The line detection algorithm is based on a modified version of the Hough transform that uses blobs rather then points and detect a line when a number of blob are consistently aligned along a given direction. Also controlling the entire algorithm (from blob to line detection) enabled us to run multiple iterations concurrently on different CPUs.

At the end of the “Court Detection” phase we thus obtain a set of lines like in the picture below.

Court detection usually runs in less than a second.

Ball Trigger

Now that a court has been detected we are in the midst of the action. Detection the ball is not rocket science: we simply perform a blob detection after subtracting two consecutive frames. But in real life a lot of artifacts can make such a simple idea a lot more difficult to implement.
For example you might have slow balls moving on the court or people on the background playing on nearby courts or blobs created by the subtle movements of a handheld device in front of the classic wire mesh fences outside the court.

So first of all the blobs detected by our custom algorithm (that runs in parallel on different cores) are stored in a circular buffer.

Then in order to filter out all these artificial blobs that are not related with the flying ball you want to measure we start filtering every frame and we discard all the blobs that are not compatible with a flying ball by size, tilting and shape (by looking at their symmetry). Also we track the movement of the blobs and so we can filter out slow moving blobs like the aforementioned balls or the player in the background. For each iteration we thus check if the survived blobs in the buffer are actually spatially aligned along a possible trajectory at a reasonable angle and with a bouncing point within the frame. If a sufficiently long alignment is reached (like in the picture below) we move on with the speed estimation.

Blobs Fine Tune

In this phase the goal is to obtain well defined ball blobs and a sharp identification of the bouncing point.

All the “Ball Trigger” related processing is performed on 640x360 frames, again a trade-off between resolution and performance: on one side we need to detect small high speed objects but on the other side we have limited computational resources to run the algorithm in real-time.

But as we can compromise resolution when it comes to accurate speed estimation we go back to the 1280 x 720 frames to perform the second stage processing.

The data obtained in the “Ball Trigger” phase are now used to guide the finely tuned detection. Also, color information are used to remove from the list of possible flying balls the shadow that would create artificial blobs along the flight.

The animated GIF below follows the flight of a ball and shows the effect of our processing: as you can see the shadows below the ball are not marked (red rectangle) as real flying balls.

At the end of this off-line processing we obtain a list of ball coordinates in 2D (or better in the image-coordinate system) representing the flying ball and its bouncing point: with the next steps we’ll translate this “flat” information in a set of 3D points in space and we’ll compute the trajectory and the ball initial speed.


Speed Estimation

As a CMOS sensor is by definition a flat sensor it’s hard to understand where those blobs where actually located in 3D: the information is actually compressed and there’s no way to know whether two blobs were generated by either a ball flying at low speed close to the camera (white circle on the blackboard below) or at high speed far from it (red circles).
In this case for example in the same timeframe the white circles represent a ball that flies for 1 meter while the red ones one that covers 2 meters, that is with a speed twice as fast.

Our original concept started by using the net as the shooting point but that turned out to be quite inconvenient as it wasn’t possible to recover the ball-to-camera distance.
In order to know where the ball was actually passing we decided to use the bouncing point of the ball on the court. Not only the bouncing point gives an estimation of the ball distance to the camera and thus a good approximation of the distances to the camera during the entire trajectory but it also improved the robustness of our the ball detection algorithm as the background is definitely more homogeneous than the one you could shoot while standing on one end of the net.
Finally it gave us the opportunity to add a line-calling feature — a new feature that no other app had ever presented on the App Store.

But the information of the position of the ball within the court is not sufficient to fully compensate all the perspective related issues: as a matter of fact what we really need to know is the camera-to-ball distance so… we have to estimate also the camera-to-court distance!

In this case the standard dimensions of the court comes at hand: as a matter of fact by selecting references points (like the intersection of different lines) we have robust fiducial references to understand where the user is standing on the court.

Standard camera pose estimation techniques rely on the availability of 4 non aligned points lying on the same plane (formally speaking four coplanar and not collinear points).
A great reference filled with all the mathematical details for this topic (and much more) is the gold standard book on multiple view geometry by Hartley and Zisserman (http://www.robots.ox.ac.uk/~vgg/hzbook/).

Despite the consistency of the fiducial references offered by every standard tennis court we couldn’t get 4 (non aligned!) points within the camera viewing angle. As a matter of fact what we can reliably detect are five points (the red dot intersections in the figure below) but only two of them provide independent information.

So we came up with some smart idea on how to simplify the problem.

Camera Pose or Homography Estimation

(WARNING! Skip this paragraph if you are not interested in the mathematical details!)

If we try to simplify in two lines what is needed to compute the relative position of a camera to an image in a 3D world (the court lines in this case) we can start by listing all the variables that you need to know:
- a calibrated camera (meaning a camera with a known focal length (f) and sensor position or principal points (cx, cy)) also known as the intrinsic camera parameters. We used a pin-hole camera model, that is the simplest camera model that assumes no radial distortion. This is a common assumption that effectively worked for us but could introduces issues in other types of applications.
- the relative position of the camera to the target, that is a translation vector (that is the three distances that describe the distance between the camera and the target in 3D) and the rotation of the camera (again three angles corresponding to the three possible rotation angles around X, Y and Z axes). These are the extrinsic parameters.

So we need 3 intrinsic parameters and 6 extrinsic parameters for a total of 9 unknown variables.

In our problem the intrinsic parameters are known values: by calibrating the camera (actually few devices for each available model on the market — we used OpenCV libraries for these calibration) we can easily obtain f, cx and cy. For those who are thinking about the reliability of these values we of course there some variability due to the zoom levels and lens fabrication tolerances are possible but variation of few % are actually handled by our algorithm (more to came in a few lines). On top of that we can say that testing on multiple devices within the same generation we always found very consistent result with differences in f well below 1%.

On top of these three known variables we can leverage on the internal sensors to get additional information: as a matter of fact we have two other variables are actually under our control that is the pitch and roll of the camera (that is their rotation around the two axes depicted below). (Yaw, the third angle, is not actually usable as it’s basically related to the absolute rotation against the north pole and has no relation to the court positions compared to the that reference).

So we have a system with 9 variables and have 5 known values (the three intrinsic and 2 angles) so awe still miss 4 conditions to be bale to solve our linear system.
But as each 2D-to-3D (or camera-to-world) corresponding points provide 2 conditions and we actually have 4 more conditions to use to solve the system!
Ideally we have all what we need to exactly solve the system and get the missing information but due to the many sources of errors (sensors providing the angles, pixel level errors in the estimation of the intersections, initial calibration parameters variability) we iteratively solve the system for a range of input values (different roll pitch and f values) and then re-project for each solution the court line. Then we search for the solution that minimize the distance between the reprojected lines and the real lines (detected during the “Court Detection” phase). This last process is shown in the animation below.

2D-to-3D Projection

At this point we can thus assume we have a reliable homography matrix to link points on the flat 2D image sensor to those that lay on and the plane of the court lines.
A final step is needed to estimate the 3D position of the flying ball so to compute its instantaneous speed for each of the grabbed couple of frames. The homography matrix only enables transformation from those two planes (the sensor on one side and the court on the other) so we assumed that the ball during its flight was on an imaginary plane (“Serving Plane”) where the line that connects the bouncing point and the serving point lies. This point is assumed to be at 1 meter form the central line.

In this way it is possible to estimate the height of the ball during its flight with good accuracy and from the raw 2D coordinate computing the corresponding speed.

Initial Speed Estimation

In PRO tournament initial speed is measured but we cannot shoot the entire scene from service point down to the landing spot (and we really want to have the landing point to provide line calling capabilities!) so we need to perform some speed estimation based on the available information.

The idea is very simple and it’s fully based on the basic physics that define the dynamics of a ball in air.
An interesting article on the parameters affecting the flight of a tennis ball can be found here.

We basically started by numerically solving (using a fourth-order Runge-Kutta method) the system of equations that describe the flight of a ball in air.
Then we find the solution that minimize the “distance” between the calculated trajectory points (their position and speed) and those of the points shoot with the camera.
In this way we can estimate not only the initial speed of the ball but also its trajectory.


Final Performance Validation

Just to give a final sense of how well all this stuff come together we’ll report here some key stats of the accuracy performance.
All our algorithms have been validated by using a professional radar with 0.5km/h accuracy.
In 90% of our test the accuracy was <5% and in the remaining <10% with an average accuracy of 3.5%.

The performance are enough for every tennis player that wants to track and improve his serve or is just even curious to know how his shot compares to those of the PROs.
A recent study performed with a high speed camera (“Validation of a live, automatic ball velocity and spin rate finder in tennis” available here) reported a similar accuracy and was considered an effective tool for players and coaches.

Final comment on Computer Vision Apps

Computer vision already achieved significant goals in the last decades and is now getting more and more closer to our everyday life thanks to the huge improvements in CMOS sensors and integration of multicore CPUs in mobile devices. It is very uncommon to see applications with an urgent need of high accuracy results in real-time running on mobile phones so we are really proud of our performance and happy to see that all our efforts finally became a real product.

And don’t forget to download the Lite version (for free!) or the Pro version (4.99$)!

Many more details and subtle technicalities could be added here but we think we already made it too longs so please feel free to contact us for more details at our email (info at tenniscamera dot com) or by connecting with us at the links below.

Tomas Camin + Tommaso Borghi = the Tennis Camera Team

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.