Photo by Felix Mittermeier on Unsplash

What is an ELO Rating?

The mathematics behind it, and its association with chess, video games, FaceMash and Tinder

Raghav Mittal
Published in
10 min readSep 11, 2020

--

The Elo rating system is a method for calculating the relative skill levels of players in zero-sum two-player games.

ELO is often written in all caps but it doesn’t have a full form — it’s simply named after its creator Arpad Elo, a Hungarian-American physics professor born in 1903.

Most people associate Elo with the game of chess — it is used extensively by national chess federations, online chess websites, and even by FIDE (the governing body of international chess competitions) to determine the world rankings of Chess players. In fact, Arpad Elo was a chess master himself.

But the Elo rating system is also used in A LOT of other games, including basketball, american football, rest-of-the-world football, baseball, board games such as Scrabble, and even video games such as Overwatch and PUBG.

History

Before the Elo rating system was devised, the US Chess Federation (UCSF) and other organizations used the Harkness System, first published in 1956 by the chess organizer Kenneth Harkness. For a competition, the average rating of all the tournament’s players was calculated first. If a player scored 50% (won half, lost half), they received the average competition rating as their performance rating. If they scored more than 50%, their new rating was the competition average plus 10 points for each percentage point above 50. If they scored less than 50%, their new rating was the competition average minus 10 points for each percentage point below 50.

Let’s consider an example:

  • The average rating of a competition is 1850.
  • A player with a rating of 1600 takes part.
  • The player wins 3 out of 11 games (27.3%)

Since their score is is 22.7% below 50%, their new rating is now 1850 — (10 × 22.7) = 1623.

Quite simple and effective, the Harkness system tracked individual player ratings in terms of wins, draws and losses in tournaments. However, many observers often considered these scores to be inaccurate and in 1959, the USCF gave Arpad Elo the task of devising a new system that had a more sound statistical basis. The result — the Elo rating system.

ELO Explained (no math)

The performance in the ELO system, as with the Harkness system, is not measured in absolute terms. It is inferred from wins, losses, and draws against other players. Players’ ratings depend on the ratings of their opponents and the results scored against them.

After every game, the winning player takes points from the losing one, and the number of points is determined by the difference in the 2 player’s rating.

  • If the higher-rated player wins, a few points are taken from the lower-rated player.
  • If the lower-rated player wins, a lot of points are taken from the higher-rated player.
  • If it’s a draw, the lower-rated player gains a few points from the higher rated player.

ELO Explained (yes math)

Elo’s main assumption was that the chess performance of a player in each game is a random variable, and that it follow a normally distributed bell-shaped curve over time. Thus, while a player might perform significantly better or worse from one game to the next, the mean value of their performances (a reflection of their true skill) would remain the same. The assumption here is that this mean value of the performances for any given player only changes slowly over time.

The difference in the ratings between two players serves as a predictor of the outcome of a match. If players A and B have ratings Rᴬ and Rᴮ, then the expected scores are given by:

The formulae for calculating expected scores given Elo ratings

A player’s expected score = their probability of winning + half their probability of drawing. If two players have equal ratings (Rᴬ = Rᴮ), then the expected scores of A and B evaluate to 1/2 each. That makes sense — if both players are equally good, then both are expected to score an equal number of wins.

Sometimes when a player’s actual tournament scores differs from their expected scores, the scores need to be adjusted upwards or downwards. Elo’s original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player over-performed or under-performed. The maximum possible adjustment per game, called the K-factor, was set at K = 16 for masters and K = 32 for weaker players. If Player A was expected to score Eᴬ points but actually scored Sᴬ points, the player’s rating is updated using the formula:

Formula for updating a player’s ELO rating

Let’s consider an example:

  • Anand has a rating of 2600
  • Boris has a rating of 2300

Their expected scores are therefore:

  • Anand: 1/1+10^(2300–2600)/400 = 0.849
  • Boris: 1/1+10^(2600–2300)/400 = 0.151

If the organizers determined that K =16 and Anand wins, then the new ratings would be:

  • Anand = 2600 + 16 (1 – 0.849) = 2602
  • Boris = 2300 + 16 (0 – 0.151) = 2298

If the organizers determined that K =16 and Boris wins, then the new ratings would be:

  • Anand = 2600 + 16 (0 – 0.849) = 2586
  • Boris = 2300 + 16 (1 – 0.151) = 2314

A couple of cool points

  • Both the average and the spread of ratings can be arbitrarily chosen — Elo suggested scaling ratings so that a difference of 200 rating points in chess would mean that the stronger player has an expected score of ~0.76, and the USCF initially aimed for an average player to have a rating of 1500.
  • The USCF found that a logistic distribution model provided a better fit than the originally proposed normal distribution. The logistic distribution has slightly longer tails compared to the normal distribution.
  • An Elo rating is only valid within the rating pool where it was established. For example, consider a person with an ELO of 2150 in the All India Chess Federation and another person with an ELO of 2080 in the US Chess Federation — given only these 2 ratings and no other information, one cannot determine who is better.

A couple of really cool points

  • In a pure Elo system, each game ends in an equal transaction of rating points. If the winner gains 15 rating points, the loser will drop by 15 rating points. However, because players tend to enter the system as novices with a low rating and retire from the system as experienced players with a high rating, the system faces a rating deflation (A way to combat this is to use a a K-Factor that decreases with experience).
  • In the long-run, the Elo rating system is self-correcting. If a player’s rating is too high, they will perform worse than what the rating system predicts, and if a player’s rating is too low, they will perform better than what the rating system predicts. Thus, their rating eventually settles to the correct value in the long run.
  • Sometimes, the rating system discourages game activity for players who wish to protect their rating. Additionally, when players can choose their own opponents, they can choose opponents with minimal risk of losing, and maximum reward for winning. These are the main reasons for many organizations opting out of using an Elo-based rating system.

Where else is ELO used?

Chess Boxing

I’m not even kidding — chessboxing is a real thing. As the name implies, you play chess, then you box, then you play chess, then you box, and this keeps going until you either checkmate the other person or knockout the person in the boxing ring. The current minimum requirements to fight in a Chess Boxing Global event include an Elo rating of 1600 and a record of at least 50 amateur bouts fought in boxing or another similar martial art.

Athletic sports

In tennis, the Elo-based Universal Tennis Rating (UTR) is the official rating system of major organizations such as the Intercollegiate Tennis Association and World TeamTennis, and as I’m writing this article (September 2020), the number 1 spot is held by Novak Djokovic with a UTR of 15.96. FIFA, the governing body of association football, uses an Elo-based ranking system for its FIFA Men’s and Women’s World Rankings for national teams. In addition, Elo ratings have been adapted for Major League Baseball, the NBA, the NFL, The English Korfball Association, the National Hockey League, and previously for American college football as part of its Bowl Championship Series rating systems until 2013.

Board Games

National Scrabble organizations compute normally distributed Elo ratings. except in the United Kingdom, where a different system is used. The popular First Internet Backgammon Server (FIBS) calculates ratings based on a modified Elo system, and the UK Backgammon Federation uses this FIBS formula for its UK national ratings. The European Go Federation also adopted an Elo-based rating system initially pioneered by the Czech Go Federation.

Card Games

The DCI (formerly Duelists’ Convocation International; now the Wizards Play Network) used Elo ratings for tournaments of Magic: The Gathering and other Wizards of the Coast games up until 2012, and Pokémon USA used the Elo system to rank competitors in its trading card game (TCG) organized play. They both opted because of similar concerns — that highly rated players often avoid playing to ‘protect their rating’.

Video Games

PUBG is one of the few games that uses the original Elo system. Winning increases the rating and losing decreases it. The change in the ratings isn’t abrupt, so losing one game isn’t a determining factor. PUBG has separate ranking systems for each game mode.

The Esports game Overwatch uses a derivative of the Elo system to rank competitive players with various adjustments made between competitive seasons. In the MMORPG Guild Wars, Elo ratings are used to record guild rating gained and lost through guild versus guild battles, which are two-team fights. Leagues and match-makers for the skill-based game Counter-Strike: Global Offensive often use Elo ratings (the game’s own matchmaking system uses Glicko-2 though). In addition, the games Puzzle Pirates, Roblox, Quidditch Manager AirMech, Mechwarrior Online all include modified versions of the Elo system.

World of Warcraft formerly used the Elo rating system when teaming up and comparing Arena players, but now uses a system similar to Microsoft’s TrueSkill. The MOBA game League of Legends used an Elo rating system prior to the second season of competitive play. League of Legends used the classic Elo system until season three when the game deployed its own system.

FaceMash

FaceMash was Facebook’s predecessor, and was developed by Mark Zuckerberg in his second year at Harvard. If you’ve seen The Social Network, you might remember this scene when Zuckerberg asks Eduardo Saverin for the equation used to rank chess players, except this time it was for rating the ‘attractiveness’ of female Harvard students.

Saverin writing the Elo equations on the dorm window as Zuckerberg looks on (Credit: ’ The Social Network’ — Columbia Pictures)

The movie claims that the Elo equations (although written incorrectly in the above scene) were used in the algorithm in the original FaceMash website. Two students were shown side-by-side and users could vote on who was more attractive. In this scene, Rᴬ can be interpreted as student A’s Elo rating, Rᴮ as student B’s Elo rating, Eᴬ as the probability that student A is more attractive than student B and Eᴮ as the probability that student B is more attractive than student A.

In the movie, it is shown that the site became extremely popular in a short amount of time and even crashed the Harvard servers. Zuckerberg was even charged by the administration with breach of security, violating copyrights, and violating individual privacy, but these charges were ultimately dropped.

Tinder

Tinder, a social networking and online dating app, allows users to swipe left or right based on whether they like or dislike a profile. If two users right swipe on each other, they are ‘matched’ and can then exchange messages.

As with most other dating websites, their matching algorithm is strongly kept a secret. However, in a blog post dated 15 March 2019, they acknowledged that they did use an Elo-score as part of their algorithm that considered how others engaged with one’s profile, but that it’s no longer being used.

The blog added:

A few years ago, the idea of an “Elo score” was a hot topic among users and media alike. And sometimes, it still is. Here’s the scoop: Elo is old news at Tinder. It’s an outdated measure and our cutting-edge technology no longer relies on it.

So, this part of our algorithm compared Likes and Nopes, and was utilized to show you potential matches who may be a fit for you, based on similarities in the way others would engage with profiles. Based on those profile ratings you received, there was a “score” — in the sense that it was represented with a numeric value in our systems so that it could factor into the other facets in our algorithm.

Today, we don’t rely on Elo — though it is still important for us to consider both parties who Like profiles to form a match. Our current system adjusts the potential matches you see each and every time your profile is Liked or Noped, and any changes to the order of your potential matches are reflected within 24 hours or so. There you have it.

Tinder didn’t clarify what the exact formula for calculating the Elo scores was, but it’s clear that they don’t use it anymore. If, by any chance, you do know how the new rating system works, let me know — asking for a friend.

If you like Purple Theory, sign up for the email list to get notified about new posts!

--

--

Raghav Mittal
Purple Theory

Don’t read this bio, read Purple Theory instead