Machine learning in tennis (U.S. Open Men’s Singles 2019 Predictions: Who Will Win The Grand Slam?

Manja Bogicevic
Kagera
Published in
4 min readAug 31, 2019

The US Open has begun, and the entire world is watching. Fans of tennis are excited to watch Federer, Djokovic, and Nadal compete on the court. Fans of betting and machine learning are excited to see how machine learning predictive models work, fed by ATP and WTA matches over nine seasons and factoring in surface and more. There is a real possibility to predict the winner for the upcoming US Open championship.
For Novak Djokovic, this is the possibility to overrule Federer. We are all wondering If he is going to succeed in that, but I don’t believe in miracles just in hard work and data. As you all already know that.

Last year I played with football data and predicted the World Cup Winner, now I am playing with tennis data. For all of you who don’t know about my sports career, I played professional tennis a decade ago. So in this blog, I am combining my two biggest loves: tennis and machine learning (come on :) )

How actually prediction works and world best examples

As a result of combining data that predict match outcomes and contains post-match statistics (number of unforced errors, breakpoints won, double faults ), “Andre Coran and his team have built a data source that contains information on 46,114 matches. To build the most accurate model, Andre gathered a wealth of match information from the past two decades. Also, Andre tested several machine learning algorithms before he was satisfied with his results. Andre felt it was necessary to give a variety of Machine Learning schools of thought a chance and landed on Bayesian Machine Learning, the same specific machine learning. The bottom line is that the most successful algorithm has fast scalability without expending too much computer processing power, which made Bayes’ approach the obvious winner due to its ability to handle a rapidly ever changing data set. Like the stock market. Yes, you can say tennis is like a stock market. “

Most common questions in our company Kageera we ask ourself, “how can we tune this hyperparameter?” and “How accurate is the model?” First, a hyperparameter is essentially an input that the machine uses to learn and develop predictive trends from the data. Tuning these inputs should involve experimenting with a few different values, analyzing the results, and repeating this process. The answer to the second question, an algorithm is going to run using different subsets of input data, and the results are going to be compared to a tested set of data where the machine learning model was not trained at all. Herer the Machine learning model being trained is being compared to an uneducated machine to gain further knowledge regarding its ability to learn and generalize patterns.
When we talk about the random forest, the most significant pros are that it is speedy to train model and we have accurately fitted betting curve that predicts who wins. However, on the other side, cross-validation results in only 69% accuracy.

When we talk about neural networks, pros are the capability of illustrating complex relationships. However, on another hand, it is less accurate than a random forest model, and there is significant computation power to tune parameters. We can say that using XGBoost in most cases could be the perfect solution.

When data is derived from a sport like a tennis with an enormous amount of variables and thousands of different matches documented, tuning the parameters is vital in the continuous improvement of the model, there is a high degree of complexity when analyzing matches because you have to account for the intangible factors, including surface, previous match history, and player’s confidence. It would be truly incredible to see the evolution of the random forest model.

With the 2019 US Open starting, let see If we could predict how this tournament is going to finish. Most wanted question is: Will Novak Djokovic continues his domination, or will we finally see the next generation break out?
Continuing the approach I used previously on football data and random forest for the Wimbledon predictions (and following Kageera machine learning workflow process that we developed in-house) we simulated both the men’s and women’s draws for the 2019 US Open.

Behind the scenes

We started with the result of every match for ATP and WTA tour matches from 2010 through 2018. Using this data, we built a historical dataset containing past results, both overall and surface-specific and tournament information, then used random forest and XGBoost to determine the best model and predict the probability that a player would win a set.
Once we had built this machine learning model, we took the draw of any tournament and simulated the results 70,000 times to find out how often each player would win with that particular draw.
When the draw is completed, we know the 128 men and women who are going to compete in the 2019 tournament. Based on our simulations, the top ten men most likely to win the US Open are given in the sentence below, with Federer as the favorite with a 33% chance of winning. He is followed closely by Novak Djokovic with 31 %. Then we have Nadal in third place and Dominic Thiem with 2%. We should look at what is Dominic doing on this championship for sure.

Interested in Sports Analytics?
Send me a line and me and our company Kageera (as well as our worldwide machine learning experts are here to help leverage your company with the use of the machine learning )

P.S.

Drop me a line on manja.bogicevic[a]kageera.com or on LINKEDIN

Yours Manja

--

--

Manja Bogicevic
Kagera
Editor for

|Optimize production & minimize downtime with machine learning| Founder & CEO Kagera.ai