Keep Role-ing

As promised in the previous blogpost, there was a bunch of features still to do before mid-February. This last week, mostly due to lack of access to a decent PC whilst moving apartments, I’ve crunched through a bunch of these (and more!).

API Functionality

I’ve finished migrating all the existing Queries (ones that run through the Query Cache), except for MatchFinder Classic, Crits, and Frames. MatchFinder will still be done, but Crits are not used frequently so I’m less inclined to rush that. Frames are getting a general rewrite into the new API system at some point — so they’ll not only be faster, but also come into the API with more options. At some point (quite a bit lower priority) I’ll look to move the less complex queries into the same system, like /leagues/$leagueId, /players/$steamId, etc.

Data Backfilling

Valve have done some naughty things in the past, including deleting historic match data from their WebAPI for a few teams. Our general ingest process is using a match stub from that WebAPI and then enriching data to it (and associated entities) from a variety of sources. We always had the match stubs correct, it’s just a question of some missing which team they were representing. Sometimes it’s a simply a case of teams forgetting to set their team correctly in the lobby, or a lobby bug which removes the set team information, and sometimes it’s due to Valve …

Based on the replay parsing, I’ve managed to fix a large number of these broken ‘team appearances’ to their correct teams, and will likely try manually fix some more over time (might make some code to spot situations like if you see Loda, s4, AdmiralBulldog, EGM and Akke in a team in mid 2013 — you can probably safely mark it as Alliance). As shown in the diagram below, the average error rate has decreased from 11.26% to 4.83% (6.43% shift) within the parsed professional match database; and hopefully we’ll see such gains also occurring when our Source 1 parser is done.

Bonus: some astute readers might wonder why the total valid and invalid team appearances do not sum to an even number. This is because Valve, in their naughtiness, retroactively changed VP Polar’s games to have VP’s team ID after VP Polar disbanded. In games where VP Polar played the real VP it now seems like VP vs VP, so there’s only one team appearance (with 10 players associated with it!). I’ve manually fixed a bunch of these games — so there’s a low, odd number of them still broken.

Teamfight Data

About 3 years ago, Decoud integrated teamfight capturing into datdota. It was a series of rules which did an excellent job at classifying and filtering out teamfights — and datdota was the first site to have full teamfight parsing.

Step forward to today, and we think that this initial approach missed some edge cases; and could be improved upon in general: more data, more context. The processing side of this is the first component, and thanks to some excellent work from invokr - this is now done. Once it passes review, it’ll find it’s way into the site. We’re still looking on the best way of representing this data.

Hero Head To Head Elo

A very interesting feature I had on my old, now defunct stats site ( was the ability to look at Elo shifts per hero, and filter by enemy heroes. This is something I wanted to revisit and revamp for the new datdota, and it’s now live.

It’s a simple way of evaluating specific matchups between heroes with a normalized statistic to account for the skill difference between the teams. A hero that is 10–13 seems bad, but if 20 of those 23 games saw the hero played by an awful team against an excellent team; then the hero is probably very good!

This table uses the data on the datdota ratings page for normalization.

It automatically sorts the axes by the most popular heroes within the result set. You can filter to only show matchups where more than some threshold number of games have happened with that matchup (else it’s just a blank cell). Hovering over a cell lets you look at the exact win-loss record.

Note that the table is symmetrical (the win-loss and Elo shift of A vs B must be symmetrical to the win-loss and Elo shift of B vs A), until you filter by a team or player (in which case it goes asymmetric).

Role Data

Part of what I wanted to get working in the first half of this year was a ML processing pipeline for various uses around the site. I went with scikit-learn, a popular Python machine learning framework/set of libraries that I’m quite familiar with. Since the backend for datdota is written in Java/Groovy (it’s a Grails project), Spark/MapReduce were reasonable alternatives — but lack of familiarity and lack of wanting to any big infrastructure was a deterrent. I think scikit-learn also is more well known among Dota 2 data enthusiasts (from my experience), so it forms a relatively easy way for them to contribute to the site if they wish.

The simplest feature I could think to add as a proof of concept for the pipeline was a ‘role’ classifier — a way to try predict which players in a game are playing as a ‘core’, or a ‘support’. This now automatically happens for games coming into the system, and is already processed on all existing Source 2 (post-Reborn) matches. Some of the features required for this model involve mid-game data, so it’ll only be available on parsed games (i.e. Source 2 games where the replay makes it safely to the CDN ~ patch 6.84). I’m relatively happy with the data output, despite some quirks occurring — normally when a core is completely sacrificed and the game ends very quickly; or at times when a support just snowballs out of control. I’m okay with these relatively uncommon inaccuracies.

This pipeline is now productionized, and the role predicted data is accessible via multiple pages for filtering.

What’s Next?

  • laning data frontend page & filter work
  • factional advantage frontend page
  • ability / talent pages

As always, if you have data/feature requests feel free to ask away on our Discord server or via Twitter.


~ Noxville