The last fortnight has been an interesting one. What started with a simple request to include series data on our Github issue tracker began a journey down a rabbit hole that’s ultimately brought about some very positive change on the site in terms of data quality…
The issue with series data from the Valve API is that the data is awful since it’s manually created by tournament admins. Furthermore, there is no differentiation between BO2 and BO3 series (or rather there is no BO2 option, but some admins inconsistently use BO3 as if it were BO2/BO3). The feature was also added a little while into the historic lifetime of Dota 2, so the early games are all unmapped. My solution was to write some code which created series automagically, and updated them — perhaps once a day.
The spanner in the works here is that to automatically detect series, one needs highly accurate information of the teams playing in each match. Due to a variety of issues, this is not easy:
- Valve has, at some point, semi-manually merged and/or deleted some teams. Examples of this include Virtus Pro Polar vs Virtus Pro, where both teams were the same after a merge (at times in the same match), or countless Ad Finem / Hellraisers / mouz team deletions over time (so they don’t even appear in GetMatchDetails).
- Matches before teams could even be selected in the lobby.
- Matches where a team didn’t select their team because they’re incompetent and the admin didn’t correct them.
- Variety of bugs in the client which didn’t set the team right in the Web API, despite being correct in the lobby.
- Some teams actually have two versions of the same team with the same logo but different team IDs, and they alternate these in a series.
About a year ago, I was able to use the parse-logs from Source 2 matches to fix up a lot of the errors associated with API bugs and team deletions. This is most likely because the deletions were done poorly more recently, but is also more recoverable.
When beginning the Series Fixer operation, this was the state, ‘avg’ is the average error rate, tier is the tiering of the league in which the match is played (1 is premium, 2 is professional, 3 is semi-pro, 4 is amateur) and source is the source engine version used. The most important slices are tier 1 & 2.
Almost 12% error rate for professional matches on source 1, on a large sample of 19.5k matches — this is a bit too high for my liking. My first approach was to detect 5-man stacks of teams over unbroken stints of time and allocate them to the correct teams. I applied a similar approach to a few key players who also had longer periods with single teams (Arteezy, Dendi, Lakelz, Banana were a few key ones) and looked for unmarked games of theirs. Cross-referencing and comparing with their roster of the time allowed for a good game of whack-a-mole.
This provided substantial improvements, but mostly for Source 2 games — Source 1 remained around the 10% error rate. As a result I decided to bite the bullet and write a Source 1 parser to get data out. Source 1 is a lot more primitive than Source 2, and a lot of the nice entity management utilities I’d written for re-use were not available. Instead of just getting the broken Source 1 replays, I got all Source 1 matches, meaning that coding actual data extraction in the future will be a lot easier now that this is set up.
After a few days trying to download and parse 38.5k replays (of which around 2.5k are no longer available because Valve deleted them or removed the cluster from the CDN, and ~450 have irrecoverable replay bugs), I was able to diff on the unknown teams and approved each of the fixes to the broken matches.
It’s possible that some of the missing replays got migrated to another cluster, but I was operating on the old datdota (salt & cluster) dump for all of the replay URLs (since these are a rate limited Game Coordinator call and I didn’t want this to take ~80 days to update). I might go back and see if any are fixed in the future.
All in all, massive improvements for Source 1, especially in premium events (more competent admins yay!). Over the crucial tier 1 / 2 slice, including Source 1 and Source 2 there are currently 84100 team-matches (i.e. 42050 pro matches) of which 1024 are broken. This translates into a 1.22% team error rate (1.2% rate pre-TI3, 1.6% post-TI3). For these, myself and Grotzi have been working on manually researching (shoutout to Liquipedia here!) and correcting the games by hand but this process will likely take a few months (we’ve currently done around 10%). Some of the teams (like World Elite, 24 of the broken matches) need synthetic team_ids since they played before the team system was introduced.
So how does this impact data quality the data on the site?
- Better player and team historical data (less than 1% error rate by TI9).
- More accurate ratings over specific periods of time (also means the dataset is more accurate to measure rating systems over time).
- Ability to finally add series data to each match in a much more accurate way
- Ability to add in new data tracking that goes back to Source 1 times, and reparse data very quickly (we’ve also recently moved up to a beefier server which makes this even better).
- A sense of pride and accomplishment.
Coming up before TI9 I’ll hopefully have finished integrating the TI1 data that was crowdsourced, and finished up the series data (which is why I went down this goddam rabbit hole to begin with).
(be sure to chat about Dota stats on our Discord Server)