Video game rating systems: is Elo hell real?

Does it make sense to apply Elo or similar rating systems to individuals playing team games when the sets of individuals on each team is changing every match? I wrote a program to try and answer this question numerically. See code here.

17 min readJan 16, 2017

Background

I love video games and two that I’m currently obsessed with are Overwatch and Rocket League. Both of these are team games in which the best players are those that not only have excellent control of their in-game avatars, but coordinate with their teammates exceptionally well. Both games have matchmaking systems where a player can jump into a match by putting themselves into a virtual queue and waiting for the game to group them up with other players in their region at similar skill or experience levels. Both games also have competitive gameplay modes that assign ratings to players based on their performance over many matches. Unless players choose to team up with friends, they typically get matched with all new teammates and play against all new opponents every game. I became interested in the design of these rating algorithms because many of them borrow from the Elo system, which was originally designed to rate chess players, and it is not clear whether they do a good job in the context of team games with ever-shifting team rosters. Many players complain that they are underrated because they are in a lower tier of skill ratings and thus keep getting paired with teammates who do not play or coordinate well, making the outcome of their games essentially random and preventing them from moving up to their proper place in the ratings. This situation has been dubbed “Elo hell” and this post is about my attempt to see if this is truly occurring or just a convenient excuse for bad players. Not that I know any of those…

Elo ratings

Arpad Elo was a physicist and avid chess player who designed the rating system bearing his name with the goal of creating numerical ratings for chess players that not only served to rank them, but could also be used to predict the outcome of a match. So, ideally, if you know the Elo ratings of two players then you know which one is better and you can calculate the odds of a particular outcome for their match. Another feature of Elo ratings is that the total number of points distributed among a pool of players is constant. The points gained by the winner of a match equal the points lost by the loser. See here for details on calculating Elo scores.

Over the years, the Elo system has been upgraded or tweaked to fix some of its flaws and extend its use to other competitions like team pro sports. People have added ways to calculate confidence intervals and measures of volatility to account for the fact that differences in frequency of play or consistency aren’t measured in the Elo system. The Glicko system does both of these things and is used by many chess websites and games like CS:GO. The person behind the Glicko system, Mark Glickman, is also a collaborator behind the newer Universal Rating System for chess. His website has a lot of good reading if you are interested in sports or game ratings in general. For the purposes of my experiment, however, I’m going to use the original Elo system because it’s simpler to implement and the factors that motivated the addition of confidence and volatility measures will be absent from my simulation.

Overwatch uses some rating system that seems to inherit aspects from Elo ratings, but also uses game-specific stats in some way and computes confidence intervals for each player’s rating. It is possible for a full team of 6 players to play all of their competitive Overwatch matches together and yet end up with slightly different ratings. Bilzzard, the game’s publisher, has not made the details of their rating algorithm public. Presumably, their system does at least a bit better than unmodified Elo ratings would, but only they know by how much.

My experiment

To answer the question of whether Elo hell exists, I made a program that would do the following:

Simulate a pool of n players of varying ability.
Group players with similar ratings into teams and form a match.
Simulate the outcome of the match.
Update the Elo ratings of each player based on the outcome.
Repeat steps 2–5 until the average number of matches played by each player reaches some target value.

While this is going on, I also record some stats to help me figure out if this process is actually bestowing appropriate ratings on most of the players or if it is even ranking them in proper order.

Modeling the players

I started out assuming that I could model each player by two numbers. The first number is their true skill rating (srT), which is constant and represents the player’s overall skill at the game. The second number is their measured skill rating (srM), which represents the rating this player receives from the game itself. A player’s srM changes over the course of the experiment since it depends on the outcome of their matches. I let the srT’s have a gaussian distribution with a mean of 1500 and standard deviation of 350. I initialized every player’s srM to 1500. The choice of the mean is arbitrary since only differences between Elo scores have meaning. I picked 1500 simply because it corresponds to a common rating in most chess organizations or in Overwatch and so it helps me keep the numbers in perspective based on my familiarity with those games. My choice of 350 for the standard deviation was also pretty arbitrary. Its only implication is that it means a player with a rating one standard deviation higher than another player will have a predicted score of 0.88. In the Elo system, wins count as 1, draws count as 0.5, and a loss counts as 0. So a predicted score of 0.88 could mean that the better player has an 88% chance of winning or some chance of winning/drawing that also equals 0.88.

I also track both the true rank (rankT) and measured rank (rankM) for every player. The rankT for each player is based on their srT and rankM is based on their current srM. Ranks are integers ranging from 1 to n (the number of players) and higher rank means better. That is, rankM=1 corresponds to the player with the lowest current srM.

Modeling the matchmaking process

To simulate the game matching players of similar ability, I first sort a list of all the players by their srM’s, then choose a player from the list at random and alternate assigning adjacent players from the list to two teams. Each match ultimately consists of 2 teams of 6 players each, as in Overwatch.

Modeling the match

Since Elo ratings were created for a 1 vs 1 perfect information game, I model the outcome of the match by calculating a mean srT for each team, using the Elo model to simulate the probability of one team winning, and then choosing a random number from the interval (0,1). If the random number is less than the chance of that team winning, I award them a win and their opponents a loss and then update the srM’s of each player in the match accordingly. This means that every player’s contribution to the outcome is independent of the others — I’ll come back to that point later. It also means that there is a chance for the lower rated team to score an upset victory, according the probabilities calculated from the teams’ Elo scores.

srTeam_i = mean(srT_i)

Where srT_i is a vector containing all the srT’s of the players on team i.

Modeling Elo rating updates

I update the individual srM’s of each player as though they had had a rating equal to their team’s mean srM and played an opponent with the other team’s mean srM. See here for details on calculating Elo scores (same link as above).

Duration of experiment

Competitive play in Overwatch is broken up into seasons which last about 3 months so to make sure I ran the simulation for long enough, I let it go until the mean number of matches played was 200, or about 2 matches per day. Typical matches are about 20 minutes, so this corresponds to playing the game about 4–5 hours a week.

Results

Elo ratings serve the dual purposes of both ranking players and of quantifying their skill difference. To see how well this system was performing both duties, I plotted the mean |srT-srM| and the mean |rankT-rankM| as a function of matches played for n=10,000.

The error bars on each plot cover two standard deviations of the quantity being plotted, meaning that ~95% of the players will fall within those bounds.

To see how each player’s rankT compares to their rankM, I made a scatterplot of the two quantities as seen at the end of the simulation, when the average matches per player is 200. The graphs above show change over time and the one below is a snapshot of the system at a particular moment.

Note that the graph above also lists a number for something I’m calling sortedness (s), which is a way of quantifying order within a list of ranks. As far as I know this is an original idea, but if I’m mistaken please correct me. I’m defining sortedness as follows:

In this case, we are calculating s(rankT, rankM), where rankT is a vector containing the rankT for each player and rankM is a vector containing the rankM for each player, respectively. The value of meanRank is just the average rank of all players, (n+1)/2. Sortedness values lie on the interval [-1, 1]. This definition also means that a perfectly sorted player pool in which rankM = rankT will have s=1. A player pool that is ranked completely backwards, with the best players on the bottom and the worst on top, will have s=-1. A randomly shuffled player pool where rankM has no correlation with rankT will have s~0. The graph below shows the evolution of this measure throughout the simulation.

The graphs above seem to indicate a pretty large spread of rankM for players near the middle of the pack, but to get back to the question of Elo hell, we might ask: how many of the players in the top 25% ended up being ranked below average? Turns out, it is only 68 out of 2,500, or ~2.5%. About 23% of these elite players ended up being under-ranked, but within the 50–75th percentile. The remaining 74% ended up where they belong in the top 25% of rankM’s. This seems to suggest that in this model there isn’t much of an Elo hell. Good players mostly ended up with the other good players. But we don’t have to be elitist and focus on the fates of the best players— we can also ask: how many players total are underrated by more than one standard deviation? This turns out to be 381/10000 (3.8%). If we count those both under and over-rated, the total number of players rated more than one standard deviation off is 758/10000 (7.6%) . For comparison, one standard deviation in US male height is ~5.8in so this error is comparable to guessing someone’s height wrong by 1/2ft.

One big flaw in this model, though, is that there is no simulated interaction between the players on a team. Because the match outcomes are determined from the mean srT’s of each team, the skill of one player doesn’t affect the contribution of any other player to the outcome of that match. We all know this is not the case in team games, however. The basis of the whole Elo hell argument was that games requiring a high level of cooperation between teammates also have the possibility for uncooperative players to drag down their teams. In other words, this experiment needs some way of modeling player interactions.

Modeling cooperation

To allow the players on a team to affect each other’s contribution to the match outcome, I decided to add one more parameter to the description of each player. I called this a cooperation factor (coop) and used it to redefine the way in which team scores are calculated as follows:

srTeam_i = mean(srT_i)*∏(coop_i)

Where coop_i is a vector containing the respective cooperation factors for each player in team i and the symbol ∏ means to take the product of the sequence of elements within the vector. I let coop’s have a gaussian distribution with a mean of 1 and a standard deviation of coopSd. This means that the ability of each player is now modeled by two numbers, srT and coop, which can be interpreted as the quality of a player’s individual contribution and the quality of their team interaction, respectively. So things like how well a player controls their in-game avatar, including how well they move and aim, are all rolled up into srT. How intelligently they play with their team is rolled up into coop. Because the model now allows for a low srT player to provide a huge benefit to their team by having high coop, there is no longer a single number that can represent the contribution of a player to any team. To know what any one player contributed to the outcome of a match, you now need to know that player’s srT, their coop, and the srT’s of all their teammates. Cooperative teammates can now act as foils for high srT players. For the purpose of computing rankT’s, I will redefine them as the rank of a player’s composite score (srComp), which I’ll define as equal to the rating a team would have if it were made up of 6 copies of the same player:

srComp = srT*coop⁶

With this definition, the distribution of srComp’s is no longer symmetric. For a simulation with n=10,000, it looks like this:

Below are the same plots as for the previous simulation, but with the addition of cooperation factors and coopSd = 0.05. This was my attempt at a conservative choice for coopSd and implies that the top ~2.5% (two standard deviations above average) most cooperative players provide a 10% boost to the effectiveness of their teammates (2*coopSd = 0.1).

With the cooperation factors, now 30/2500 (1.2%) of the top-ranked (rankT) 25% of players end up receiving below average rankM’s. Another 527/2500 (21%) of them are under-ranked, but within the 50–75th percentile of rankM’s. The remaining 1943/2500 (78%) end up where they belong, in the top 25% of rankM’s.

The number of players underrated by more than one standard deviation in this scenario is 1033/10000 (10%). The number of players rated more than one standard deviation off in either direction is 1151/10000 (12%). Since Overwatch has > 20 million active players, this means there could be >2 million players that are significantly mis-rated in the coopSd=0.05 assumption. In the case of coopSd=0.1, a simulation yields 1349/10000 (13%) of the players underrated by more than one standard deviation and 1349/10000 (13%) mis-rated by more than one standard deviation. That is, none of the players in the coopSd=0.1 scenario are overrated by more than one standard deviation.

Now we can explore where these underrated players lie. The graph below shows the underrated players highlighted in the case of coopSd=0.05 . This indicates that being underrated by a large margin is really only a problem for the best (80th-percentile and up) players. In fact, 1033/2000 (52%) of the top 1/5 of players are underrated by more than one standard deviation in this model. Being overrated by a large margin doesn’t seem to be very common, though.

Red dots = players underrated by more than 1 standard deviation

Now what about over/under-ranked players? When it comes to rank, it turns out that 4053/10000 (41%) of the players are more than 10 percentile points (1,000 ranks for n=10,000) away from their proper rank with a roughly 50/50 split between over/under. There are even 1326/10000 (13%) of players mis-ranked by more than 20 percentile points, as shown below.

Furthermore, this mis-ranking problem seems to affect players of all abilities, as seen from a histogram of their srComp’s:

One last observation of the final state of the simulation is that there is more order in the extreme ends of the player pool than in the middle. Also, the sortedness within segments is pretty low. So, someone within your Overwatch rank having a slightly higher or lower rating than you isn’t likely to mean that much, especially if both of you have ratings near the mean. A lot of the overall sortedness of the player pool seems to come from long-range order, which makes sense given the fraction of players mis-ranked by 10 or even 20+ percentile points.

Conclusion

I believe that the strength of a team in a game such as Overwatch cannot be accurately represented as a linear combination of its players’ ratings, as that would imply that team interactions don’t matter. It seems reasonable, however, that team strength may be described by some non-linear function of player ratings. In my model, I propose such a function and I assume that a player’s strength can be defined by two ratings: an individual skill rating(srT) and a cooperation factor (coop).

If we decide that my implementation of a cooperation factor is a reasonable abstraction of teamwork, then this implies that a modest choice of standard deviation for this cooperation factor (coopSd=0.05) would lead to Elo ratings where there is fairly accurate sorting of the player pool overall, but the best players tend to be underrated and there are >10% of players mis-ranked by over 20 percentile points. Whether this constitutes an Elo hell depends on how you choose to define it, but given the size of Overwatch’s player base, I think it is safe to assume there are a large number of players that would be justified in feeling their ranking is incorrect. Based on that, I’ll conclude that Elo hell is real. Elo heaven too.

What do you think?

Does this seem like a reasonable way to answer the question of whether Elo hell exists?
Do you know of a way to answer that question analytically as opposed to the numerical approach attempted here?
Can you think of better ways to model this system?
Can you think of a better way to abstract teamwork?
Do you have suggestions for how else to represent this data?

Feel free to comment below because I’m curious to know what you think and, hopefully, improve this simulation or this write-up. Thanks to those who have given me feedback.

Afterword

After going through the process of creating this model, I am not sure if it makes sense to apply Elo-style ratings to individuals in games where the team rosters change with every match. It still makes sense to apply Elo ratings to individuals in games where they compete individually or to entire teams when the team roster is more or less stable, but to rate individuals in games with MOBA-style matchmaking I believe we can do better.

I propose that MOBA games, and Overwatch in particular, experiment with rating systems based on training deep neural networks on match data using the enormous quantity of statistics already being tracked for each player. Overwatch already tracks stats for a wide variety of things, including hero-specific stats, and I think that instead of attempting to guess at the relative importance of these metrics, we should simply train a network by feeding it samples that contain all of the stats gathered for a particular player during a particular match and the outcome of that match. With enough players and matches, this will create a neural network that can be used to rate players without taking into account whether they won or lost. Another bizarre thing about this hypothetical rating system would be that no one would know exactly what its internal representations of the data meant.

For example, say we stick with the Elo convention that a win=1, a draw=0.5, and a loss=0. Then, a set of training data for this neural network would look like a giant table of numbers where each row would represent the stats for a particular player during one match and the columns would represent the various stats being tracked. The numbers within a row would represent the measured values of those stats for a particular player during one match. One column in the table would contain the outcome of the match (0, 0.5, or 1) and the training process would make the network learn to associate different sets of input stats with the various outcomes. Once this is done, players would receive scores for every match [0,1] based on feeding their match stats into this trained network. The player ratings could then be some kind of time-weighted average of these scores. This would yield player ratings between 0 and 1 that did not depend on whether the player won or lost, but whether their measured statistics correlated with winning or losing behavior as determined from analyzing the entire set of matches played by all players. Furthermore, because deep neural networks are a kind of black box where it is currently difficult to understand the internal representations of the data, it would not be obvious to players how to cheat this system. If something like this were made, we could compare it’s predictive abilities to those of the current rating system to decide if it made sense to switch. Blizzard, if you see this, I’d be happy to make this model. Just give me the stats!

My code

Right here.

Appendix

Other things I noticed in the course of making this.

The mean |srT-srM| actually increases again if enough games are simulated without the addition of coop’s. This happens because at some point the good players go from being underrated to overrated and the reverse happens for the bad ones. Seeing this motivated me to start tracking players’ ranks. The mean |rankT-rankM| always tends to decrease.

|srT-srM| starts to increase again if enough matches are played.

This is what the time evolution of the simulation looks like when plotting srM vs srT and color-coding players according to coop when coopSd=0.05. Green = better coop rating.

If you allow the simulation to go for a very long time (mean matches per player = 1,000), this is what happens:

If you test the stability of this system by starting every player with srM=srComp (s=1), it looks like this:

The results of the long simulations (mean matches per player = 1,000) suggest that this rating system, with the assumption of my cooperation model and coopSd=0.05, seems to reach a steady state of order at around s~0.95.