Improve tennis rating system based on Glicko-2 system.

Taiga Tachibana
9 min readSep 15, 2018

--

Sports is the most enthusiastic hobby in my life. I love watching not only national contents, but also world’s excellent competition.

Since I’m learning machine learning, I started activities to predict sports result. In the beginning I decided to select tennis which is one of the most favorite sports.

To my knowledge, tennis is predictable sports. Because tennis is a one-on-one match (in the case of singles), regulations are unified in the world, and players are constantly playing throughout the year.

If football or basketball, different members will participate from the same team for each match, so it is difficult to predict which player will participate. Likewise, in combat sports like boxing, players will only participate in the game several times a year, so it is difficult to comprehend the current player’s skill.

How to know the player’s skill

Simple and accurate way to know the player’s skill is to set a rating. Rating system has been used for a long time, and “Elo rating system” is a representative one.

Elo rating was developed to know the skill of chess players, but now it is applied to various sports including tennis. This system is simple, but if the matches are played constantly, enough accuracy is guaranteed.

Elo rating

Summary of Elo rating calculation is as follows.

  1. Players are given a default rating of 1400. (This value can be changed by the situation)
  2. The expectation of the winning chance is calculated from the opponent’s rating. If opponent’s rating is higher, the chance will be lower, and conversely the chance will be higher if opponent’s rating is lower.
  3. The value obtained by subtracting 1.0 from the expected winning chance, and multiplied by the constant K-factor will be the fluctuation value. For example, if a player’s winning chance is 0.8, the value is “(1.0–0.8) × K”.
    In case of chess, 16 or 32 is often used for K-factor, but this can be changed to be more accurate.
  4. For the original rating, the sum of the fluctuation value will be the latest rating.

There are some points to adjust for elo rating, but even with the default value will be certain accurate. Still this system has shortcoming that the rating of new players don’t stabilize for a while and the opponent is affected from the new player’s unstable rating.

Glicko rating is widely used as an improvement of these problems.

Glicko rating

Since Glicko rating often gives more accurate than Elo rating, it is also used for many sports and online game ratings. (However, it is not always more accurate)

Glicko rating mainly has the following features.

  • Each player has a “ratings reliability”(RD) in addition to ratings, and if the number of matches is small, RD value is high and the rating is considered unstable.
  • If the RD value is high, the fluctuation range of the rating is large, and new player will quickly settle to the proper rating. And the effect for the opponent is smaller than elo.
  • RD becomes smaller as the player gets a match, while RD increases again by each “rating period” for a certain time. This takes into consideration that player’s skill will change as time passes.

And the Glicko-2 rating system is also available. This improves on the Glicko by incorporating the rating volatility “σ” factor.

Just like elo, glicko also has parameters to adjust, anyway there are many cases that the accuracy is higher than elo because it adopts a reality approach.

Issue of glicko rating in tennis

Parameter tuned glicko is sufficiently accurate. But glicko was devised for chess and there are some points to improve for tennis.

I’m not familiar with chess, but I think the ups and downs of the player’s condition is probably more frequent than chess. It is difficult to maintain good performance throughout the season, and many athletes are playing with injuries. (Of course there’s no intention of being easy to maintain chess’ condition.)

From this point, the idea that rating will stabilize as more matches are played is questionable, I think the rating deviation should be changed every time match is played. In the case of a result which isn’t as rating, the player’s rating is probably more unstable, and more stable if result is expected.

And default score in glicko, “1” is set for win, “0” for lose. There is room for improvement. For example, if the player who is unlikely to win in terms of ratings loses but with a close match, this player’s rating should be raised. Even with the same result, by setting “weight of win” depending on how many games have been taken, actual skill would be reflected accurately.

In addition, if player doesn’t play matches constantly, its condition will drop in a short term and it will be far from top performance without playing for half a year. Such players might have been injured, and rating after return should be much more unstable. In glicko, everyone’s rating equally becomes unstable with certain period, but this change is moderate. Once rating gets stabilized, it rarely gets unstable after the elapse of time. So it is realistic that the inactive player gets unstable more quickly.

These points are summarized as follows.

  1. Weight of the win or lose is determined by the number of game taken.
  2. “ratings reliability” (RD) changes by every match. It will be stable if it comes to the expected results, otherwise it will be unstable.
  3. Inactive player’s rating will soon become unstable, active players won’t become unstable over time.

Implementation

Since glicko is already sophisticated, I changed only three points based on Glicko-2 system.

Determine the match score

Split the match for set, determine the score by how many games taken in each set, and add 0.5 for total of all sets is the final score.

Although this is not the optimal parameter, I simply define each set score as follows.

Likewise, define the tie-break score.

Winning with 6–0 6–0, the score for this match is “0.4 + 0.4 + 0.5 = 1.3”. And winning with 7–5 0–6 7–5, score is “0.2–0.4 + 0.2 + 0.5 = 0.5”. Thus score varies depending on the point.

In case of 6–3, essentially, score should be changed whether break count is 1 or 2, but it can’t be identified only by this figure. So the same score will be applied regardless of the number of breaks.

Change RD per match

In glicko, expectation score is calculated by player’s rating and RD, and new rating is decided from the difference between the actual score. So, I will calculate the fluctuation of RD from this difference.

However, for the players with a high RD, even if there is a gap between the expected and actual score, RD should be small in order to stabilize the ratings. Therefore, as RD is larger, it tends to decrease RD after match.

I think it is easier to see the program than to explain it in sentences, so post the code below. (Implemented by Golang)

function RDFluctuation(rd float64, score float64, expectedScore float64) float64 {
// considering how RD should reach in the future, set RD to 90, gap of result to 0.4
baseFluctuation := math.Sqrt(90 / math.Sqrt(0.4))
// calculate the fluctuation RD from the score gap and current RD, but adjust the score for convenience.
score = math.Max(math.Min(1.0, score), 0.0)
scoreDiff := math.Max(math.Abs(expectedScore - score), 0.1)
actualFluctuation := math.Sqrt(rd / math.Sqrt(math.Sqrt(scoreDiff)))
return (baseFluctuation - actualFluctuation) * 5
}

Although formulas and parameters not optimal, as a result of the verification, RD generally got the assumed value, so I adopted this.

Destabilization for inactive player

In Glicko-2, RD is increased every certain period by formula using rating volatility “σ”. Default increment is too small to apply to tennis, so simply multiply the default value by 30.

And rating period is set to 40 days as the length considered to affect the performance.

(*These parameters have much room for optimization.)

Based on this rule, rating of player who doesn’t play about half a year is certainly unstable.

Outcome

I checked how much customized glicko contributes to accuracy. I simply checked the probability that a high-rated player would win in each game.

From a different viewpoint, when assigning a player’s rating, in addition to rating through all matches, set ratings for current court.

Tennis courts are categorized mainly in hard, glass, clay, carpet, indoor. Because players are good at or not good at each court, by setting a rating for each court and adding it along with the coefficients to the whole rating, we can predict more accurate result.

This time I will check how accurately the rating calculated by the formula “Overall Rating × 0.71 + Rating of Coat × 0.29” predicts win or loss.

Let’s compare three of Elo rating, Glicko-2 rating, and improved Glicko-2 rating (Custom Glicko Rating).

By the way Elo rating and Glicko-2 rating set the following parameters. It conforms to the default value, apart from elo’s K-factor is larger than chess.

  • Elo rating — Default rating is 1500, K-factor is 40.
  • Glicko-2 rating, Custom Glicko rating — Default rating is 1500, RD is 350, Volatility(σ) is 0.06.

This verification uses the data of OnCourt which is rich tennis database. And since the influence of time lapse is more reflected, I will check women’s all matches.

Rating calculation started from 1990, and matches from 2014 to August 2018, I investigated the probability that high-rated players will win.

Not to consider settlement due to accident, matches less than 5 games are excluded. And to guarantee that court rating is stable, those with less than 10 games on current court of either player are also excluded.

Then, Here is a breakdown of results.

Compared all rating system
Elo stats
Glicko-2 stats
Custom Glicko stats

As you can see from the graph above, Glicko-2 outperformed Elo in all years, and Custom Glicko outperformed Glicko-2 except in 2018.

Custom Glicko shows figures of around 70% in all years, this is comparable to the result of machine learning I did. (By the way, machine learning adopted a different method)

Focused on the probability of winning this time, but the rating deviation(RD) appears much more properly than the Glicko, so I could more accurately predict the result combining rating and RD.

More accurate approach

In this time, I adopted three approaches according to the actual situation of tennis from the Glicko-2, but the parameters are not tuned at all points.

Just changing the parameters manually will probably increase the percentage, but there is not enough time to reach the optimal value.

For example, I set the “weight of win” as “0.2” when player gets a set with “7–5”, but it is nearly impossible to logically figure out what the optimal value is.

In this case, it would be the appropriate way of finding values with Reinforcement learning. By using the machine learning approach, it is possible to find more proper parameters than at present.

If I could get more accurate result, I would post articles someday.

Thanks.

--

--