Australian Open 2020: Predicting ATP Match Outcomes

Hong Xiang Yue
Analytics Vidhya
Published in
11 min readJan 19, 2020
Photo by Ben Stansall on GettyImages

If you’re any sort of tennis fan, you’ve probably been looking forward to watching the 2020 Australian Open and guessing who’s going to clinch the top spot. And if you’re a data nerd like me, you might be interested in how we can use data to inform our predictions.

This article will give an overview of the workflow and methodology I used to arrive at my predictions. This project is done with the same aim as Betfair’s 2019 AO Datathon which is to make predictions for all possible match outcomes for the men’s division, but for the 2020 Australian Open. My methodology leans heavily on the work of Betfair’s Data Scientists, in particular Qile Tan’s notebook which you can check out here. I’ve repurposed Tan’s code to suit the task making heavy modifications in some sections but leaving others relatively untouched. My notebook and the full code used to make predictions, and the predictions themselves are on my GitHub repository here.

Disclaimer: I am not sponsored by Betfair, I do not endorse gambling, and if you choose to use my findings for your own betting strategies, then you are responsible for any gains or losses you may incur.

The data

The data that we’ll be using is ATP match level data obtained from the R package ‘Deuce’ written by Skoval. This package scrapes tennis data from a variety of sources including Jeff Sackman’s GitHub repo. As far as I’m aware this data is publicly available so you’re free to have a play around with it yourself.

Each row contains information about a particular match such as the name of the winner and the loser, their ranks and country of origin, and various match level statistics for each player, such as the number of games won, the number of service points won, how many times they double faulted etc.

The first step before we can jump into using a machine learning model is to clean the data. We’ll need to parse out the game scores, reconstruct returns data and fill in missing values. I’m not going to go through the details of it here, but you can check out my GitHub if you want to see the code.

Understanding the task

Before diving into the machine learning, we’ll need to develop an understanding of what exactly we’re trying to accomplish. My objective is to predict the outcomes of all possible matchups for the Australian Open before the tournament begins based on player statistics available before the tournament. For example, if Nadal were to cross paths with Federer we want to predict who would win, but we’ll also want to predict how he fares against Djokovic, Medvedev, Tsitsipas and every other player in the tournament. We’ll also want to do the same for Federer and predict how he will fare against the other players in the tournament. Then we’ll repeat this process again for all other contestants.

Two things to note: first of all, not all matchups that we make predictions for will be realised. Federer may get knocked out in the first round, but we’ll still want to make predictions for his matchups against the other players in the tournament for the much more likely case that he does survive. Same goes for a relatively unknown player like Michael Mmoh, there’s a high probability that he will get knocked out early on, but we can’t be 100% sure so we’ll still need to make predictions about his matchups with the other players in the tournament. Secondly, these predictions will remain constant regardless of when the matchup takes place in the tournament, we won’t be able to factor in performance in previous rounds to update our predictions for later rounds. The predictions will be contingent on statistics available prior to the tournament starting.

Even then, we’ll still need to carefully define what our target variable is. For example, given a head-to-head comparison between Djokovic and Nadal we will likely ask ourselves, who is going to win? The problem with phrasing the question like this is that there are two possible targets we can model: the probability that Nadal wins, or the probability that Djokovic wins.

Rafael Nadal and Novak Djokovic at the 2019 Australian Open. Photos by Michael Dodge and Mark Kolbe, provided by GettyImages.

Our modelling process becomes more concrete if we call Djokovic player_1 and Nadal player_2 and define our target to be the probability that player_1 wins. If you want to predict the probability that Nadal wins, you can take 1-probability_player_1_wins, or swap the order of player_1 and player_2 (note that these two methods are not exactly equivalent! A bit more on that later.). I’ve chosen to train and predict on both scenarios, the first being player_1 allocated to the winning position, and the reverse case where player_2 is allocated to the winning position. Note that Tan randomly allocates the order so that for a given match, there is a 50/50 chance that player_1 is the winner. My method differs because for a given match up I incorporate both cases, not just randomly selecting one.

To help you visualise what’s going on, here’s a sample of my final predictions. Notice how Rafael Nadal appears as player_1 and player_2.

The features we will use for our predictions will be the difference in the average of each player’s match statistics for the previous x matches. For example, if we are considering a match up between Alexander Zverev as player_1 and Stefanos Tsitsipas as player_2, for Alexander Zverev we want to average his statistics (e.g. percentage of games won) across the last 10 matches, let’s say this number 0.63. We then do the same thing for Stefanos Tsitsipas, let’s say his average game win ratio is 0.68. Note that this is averaged across the matches each player individually participated in, not their previous clashes.

We take the difference between the two player features, 0.63–0.68 = -0.05 and use it as a feature for predicting whether player_1 (Alexander Zverev) wins. Since this is a negative number, we would naturally expect the odds to be slightly stacked against Zverev (all else equal). We can do this for a variety of other player statistics such as the player rank, percentage of first and second serves won or the percentage of return points won. Here’s an example of what it could look like:

Transforming the data

Now we know the structure of the data we want to use for training and model testing, we’ll need to transform the data into a usable format. This part’s not particularly interesting, in fact it’s quite tedious and a lot of hard work, but it is what makes the problem unique and differentiates it from your usual Kaggle competitions or datathons.

The first step will be to convert the data into long format, so for a particular match we will have player statistics for both the winner and the loser, both in separate rows. Then we will need to convert the raw player-match statistics from absolute values to relative ratios. This is important as the absolute value of a player’s statistics depends on the length of the match.

For instance, suppose Federer plays a best of 3 match against Kyrgios and beats him 6–4, 7–5. Federer wins a total of 13 games. If Djokovic wins an epic 5-setter against Nishikori 7–6, 3–6, 5–7, 6–2, 7–6, he’ll win a total of 28 games, more than twice as many as Federer. This isn’t a fair comparison as we’re comparing a best of 3 match with a best of 5. It makes more sense to compare their game win ratios. For Federer, his game win ratio is (6+7)/(6+4+7+5) = 0.59, for Djokovic, his game win ratio is 0.51. Comparing these two ratios is more sensible than using the totals.

Photo by Michael Dodge, provided by GettyImages

The next major step will be to aggregate the statistics for a given player over the previous x number of matches before a given tournament. While it seems rather intuitive to think about, writing code for it is somewhat tricky. I’ve made some modifications to Tan’s code such as allowing an adjustment for the rolling window length, but for the most part I’ve left the code unchanged. Note that we will only be creating aggregations for the Australian and US Open to save on computation time and because the other tournaments such as Wimbledon and Roland Garros have different dynamics (more on that later). This will need to be repeated for all players in our dataset across all Australian and US Opens from 2000–2019 (and for 2020). Here’s what some of the output will look like for Roger Federer:

Federer’s rolling averages prior to the 2005–2008 Australian Opens

These aggregates will need to be merged with the match level data, the keys which will uniquely identify our merge will be the tournament date and the player’s name for both player_1 and player_2:

Merging Federer’s match level data with his and his opponents’ rolling averages

Notice how for Federer, his aggregate statistics prior to a given tournament will remain constant. The statistics for player_2 will also remain constant, but he will be paired up with different players which drives the variation in each match. We’ll also take the differences between the player_1 and player_2 aggregates to reduce the number of features and hence our computation time. Intuitively this works because whether or not Federer wins a match depends on how good his opponent is in comparison to him.

Modelling

With all the data wrangling out of the way, we can move on to the fun part, training our Machine Learning model! For our model, I’ve chosen to use an xgboost classifier with the following settings:

model = XGBClassifier(
objective = "binary:logistic",
n_estimators = 300,
learning_rate = 0.02,
max_depth = 6
)
eval_set = [(X_val, y_val)]model.fit(X_train,
y_train,
eval_set = eval_set,
eval_metric="auc",
early_stopping_rounds = 20)

We’ll also need to split up the data into a train and validation set to prevent xgboost from overfitting. I’ve chosen to go with a simple train-validate split by training on Australian and US Open tournament from 2000–2017, and validating on the same tournaments from 2018–2019. Ideally I would like to use forward chaining, but I haven’t got around to coding that up :(.

It’s also very important that we do not include matches from either Wimbledon or the French Open as the playing styles will differ noticeably depending on the surface. For example, with the clay courts at the French Open, the ball bounces higher but travels slower leading the match to be more of a grind with less winners being hit. In addition, the movement patterns and footwork of players will involve a lot more sliding due to the slipperiness of the surface. For a visual illustration of this, watch Nadal eat Wawrinka alive at the Roland Garros 2017 Grand Final:

A similar argument can be made for Wimbledon as well. The grass court surface means that the ball travels faster, favouring the serve and volley strategy.

Plugging in our training and validation sets into the xgboost model, we get a final validation AUC of 0.78 suggesting a decent amount of predictive capacity.

Stopping. Best iteration:
[147] validation_0-auc:0.782506

While I ended up creating an excess of features, I decided to cull a few as some features seemed to hinder model performance. To analyse feature contributions to predictive power we can use the inbuilt feature_importances_ method. This essentially calculates the proportion of times a feature shows up in a decision tree. The more it shows up, the more likely it is to be a strong driver for predictive accuracy. You can read more about it here.

pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)player_log_rank_diff                    0.617868
player_game_win_ratio_diff 0.109231
player_point_win_ratio_weighted_diff 0.080940
player_serve_win_ratio_diff 0.075152
player_rank_diff 0.060499
player_return_win_ratio_diff 0.056310

Unsurprisingly, the strongest feature as identified by xgboost is the difference in the players’ (log) ranks. This is also corroborated through permutation importance:

perm = PermutationImportance(model).fit(X_val, y_val)
eli5.show_weights(perm, feature_names = X_val.columns.tolist())

Permutation importance essentially involves mucking around with a particular feature by randomly shuffling the order of observations and seeing how it affects predictive accuracy. If accuracy sharply declines, then it’s a good indicator that the feature in question was really important, if it doesn’t change much, then that feature probably isn’t important to your model. You can find out more here.

Making predictions

With most of the infrastructure laid out for us, making predictions is now relatively straightforward. In the data transformation step, we should have created aggregates for all players prior to the start of the 2020 Australian Open. In a similar fashion, these will need to be merged with our submissions data by matching on player name (for both player_1 and player_2) and the tournament date (2020–01–15). Taking differences, plugging into our model and joining the predictions back onto the submissions file, here’s what we can expect to see:

Sorting the players by their average win rate we can get a good idea of who’s likely to take the top spot:

atp_pred_submission.groupby('player_1')['player_1_win_probability'].agg('mean').sort_values(ascending=False).head(10)novak djokovic                 0.923752
roger federer 0.901148
rafael nadal 0.898214
dominic thiem 0.835166
daniil medvedev 0.804219
stefanos tsitsipas 0.795601
alexander zverev 0.786144
gael monfils 0.760937
diego sebastian schwartzman 0.748177
roberto bautista agut 0.745087

Possible concerns

While working on this project there have been a few issues which I haven’t had time to fully address. As mentioned before, swapping the order of the players and taking 1-probability of a particular player winning are not exactly equivalent. Here’s an example of what I mean:

Kill Bill sirens in the background

While close, the predicted probabilities don’t add up exactly to 1, 0.398234 + 0.606266 = 1.0045. An obvious solution is to normalise them by dividing both by 1.0045, however the code to implement this in a form usable by xgboost could be quite troublesome to write.

An additional concern is the validation strategy I have chosen to use. As I mentioned before, I would ideally like to use a forward chaining strategy, but haven’t gotten around to coding it up. Another question is how should lower level tournaments be factored in? While we include in them our averages, should we also make aggregations prior to each of these tournaments and include them in our training and validation sets?

Furthermore, the Deuce dataset hasn’t been updated to include matches from the first two weeks of 2020 such as the ATP Cup and a variety of other 250 or 500 level tournaments across Australia and New Zealand. Our aggregations will be slightly out of date, something which will not be appropriately reflected in our validation scores as they contain up to date data.

But since the Australian Open begins tomorrow I’ve run out of time to explore these issues in an unbiased manner. Perhaps this can be explored for future tournaments.

Happy modelling!

--

--