A Stronger Look at Tournament Formats

I’ve written quite a bit about tournament formats in the past — mostly criticizing events with poor or insane formats. To me this seems somewhat natural: tournament formats dictate so much of the outcome of what should be a competitive event and season; so to implement inefficient, unreasonable or unfair tournament formats is a direct attack on the competitive integrity of Dota 2.

One of the key parts of the issue at hand is that there are multiple types of people impacted by tournaments, each with their own motives:

  • players / coaches / managers: want formats that allows them to prove their worth; but not so many games that they’re spending ages playing matches that have only tiny impacts on their progression.
  • tournament organizers: want to keep costs down, but viewership (both live and remote) high. Mostly want to limit the number of days in the expensive arenas/venues to a reasonable amount.
  • fans: want as many teams playing on-stage, but not spend a huge amount of time watching the event (not super long playoff days, not too long that following the event is tedious).

The biggest concern for me (and essentially the final straw in writing this article) is that teams / managers / players — i.e. the people who actually are incentivized the most towards wanting fair formats — are personally attacked by irrational fans for wanting to improve this fundamental aspect of the tournament circuit, even when it it’s a tournament they’re personally not involved in.

Great public response #1
Great public response #2

So let’s look at the information we have about the DPC so far. After some cancellations and Valve revoking Major status from an event — we have 22 DPC events on the calendar, 13 Minors and 9 Majors. With DAC just finished and Starladder around the corner, the public has format information on 17 of these events.

Similar to my previous tournament format analysis, I’ll associate a similar scoring function for the quality of the format at ordinally ranking the teams (slightly punishing formats which don’t resolve all positions [i.e. the 7th and 8th teams ending as ‘7–8th’] — which is something I didn’t do last time).This looks calculates an error value based on “on average, how frequently did a team ranked X by skill place higher than a team ranked Y (by skill), for all Y > X”. I’ve also changed the error function slightly, so the newer values are just base error values (i.e. a lower value means a ‘better’ format). Additionally, there’s now a second tournament format metric “Top 4 Right %”, which is a percentage of how frequently the (true) top 4 teams going into a tournament end up as the top 4 teams at the end.

I took each format of the DPC, and ran a linearly skill distributed set of teams through it 10⁵ times. I also threw in some base test case formats such as various Single Elimination cases. Each head-to-head was simple, that is to say it did not consider non-transitive rivalries (A > B, B > C, C > A).

Finally, I wanted to model another aspect of a format’s quality — how resilient it is to initially bad seeding. To do this, I modeled a perception value within each simulation for each team which was based on the inter-team rating (team[i] - team[i-1]) multiplied by the normal(0, 1) distribution. This means that a strong team which event organizers might undervalue when it comes to seeding (or weaker teams which event organizers might overvalue) will be handled within those model simulations. In a format with randomized Swiss, this should have no impact; but in a format such as single elimination — it’ll have a moderately large impact.

Error rates and “Top 4 Right %” for various tournament formats

As expected, the recent Dota Asia Championships has had the best format of the season thus far in terms of error values (both with 0 and 50 perception error intervals). It gets the top 4 correct just over 1/9th of the time, the 2nd highest behind bo3 Single Elimination. bo3 SE however, drops down from 21.2% to 12.4% if we associate 50 perception error intervals, showing us that the format relies heavily on having good initial seeds; and since there’s so few games (only 15x bo3) there is also high variance associated with each match.

The Summit 8 was a surprise — the best performing 9-team format of the season (and also outperforms all 8-team formats)! As explained below, part of the format gains are because there’s 9, not 8 teams; but also because each group should have 1 very strong team, a middle team, and a weak team (the groups are a bit ‘loaded’ in the model). This means that for a group-stage upset to occur, it generally requires the middle team to upset the top team (often with a game upset from the weaker team also), and since each team plays 6 bo1s — this is quite rare. The more difficult part of getting top 4 correct is just ensuring that the 4th best team makes their way through the wildcards and into top 4.

Captain’s Draft 4.0 is statistically the worst format so far in the season — it’s only slightly better on average than randomized 8-team bo1 single elimination (error rates for that are 12.06/12.08 with perception 0/50 respectively). My only explanations for having a decreased error value with higher perception is that it’s just variance (10⁵ isn’t that big), or (more likely) that my expected seeding into group maps were poor and that a better mapping exists (so a slightly more tweaked/randomized mapping performs better).

In terms of ‘getting the top 4 right’, The Bucharest Major was a failure. Part of this boiled down to the high variance bo1 Swiss groupstage. Only ~4.57% of simulations saw the overall top 4 being right, and this dropped marginally to 4.53% with high perception (which does speak to the resilience of the Swiss stage). A single elimination bo1 (but perfectly seeded) format would’ve got the top 4 correct 5.97% of the time.

DreamLeague Season 8 / 9 is in itself a benchmark — since it’s just 8-team double elimination format. Note that 16-team events have scored higher partially because it’s slightly easier for larger formats (which differentiate a large proportion of places) to do so. In the example below showing the perfect run matrices of 8-team single elimination vs 4-team single elimination; 8-team single elim contain 25% equivalencies, whereas the 4 team version contains only ~17%. This means that it’s more difficult for formats that have more clustering (happens commonly for the last few places of most events) to score as well, so more teams means more unique clusters.

Perfect run (‘target’) matrices for 8-team and 4-team single elimination.

Where to from here?

Something I was hoping to do for this article (but realised it’s going to need it’s own) was normalizing the error value for formats with different numbers of teams. The rough outline of this idea is that 8-team single elimination is equally ‘fair’ as a 16-team single elimination (with consistent seeding); as is 8-team double elimination vs 16-team double elimination. Using varying n-team m-elimination simulations, we can fairly safely show a dimensionality reduction down to some base-dimension (let’s say 8-teams), so that all normalized tournament quality ratings are comparable (in that base dimension).

Other than that, this serves as a good way to benchmark future events (the proposed formats I’ve been told for some of the upcoming events are quiet ridiculous) — as well as evaluate suggested changes to formats:

  • what if ESL actually listened and had a single 16-team double elimination bracket which cut to top 2 winner’s bracket + 4 loser’s bracket?
  • what if if DAC scrapped the ‘breakout’ matches and instead just moved the 9–12th placed teams into the loser’s bracket?
  • what if no event ever used the ‘3 points for winning a bo2 2–0’ metric?

Events that perform well by these metrics should not rest on their laurels — but rather look to improve and push for better and better formats. Not every format needs to be the same, but events should experiment with formats which are at least reasonable and sane for the teams participating.

Since this was a lot of code for an article (~1k LOC) and it’s possible I made some mistakes — there’s a github of all the code here to check out.


~ Noxville

Like what you read? Give Ben Steenhuisen a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.