Predicting Win Rates in Age of Empires 2 HD

Mike Xie
6 min readJan 10, 2020

--

Age of Empires 2 is a famous medieval war game that’s about as historically accurate as Mel Gibson’s Braveheart. It was originally released in 1997 and has a cult following to this day. While the RTS genre has been dying for a long time, this game has warranted two re-remasters. The first one in 2013 to 1080p and the second one in winter of 2019 to 4K. On release, the 4K version broke top 10 most viewers on Twitch and still continues to have numbers similar to longtime E-sport and rival Starcraft 2 which has an entire professional ecosystem in Korea.

Unlike Starcraft 2 however, the game’s playable factions aren’t really balanced, Microsoft until recently didn’t support tournaments at all, and these are all still using the same game engine from 1997 so the UI, unit AI, etc. are all terrible by today’s standards and the game still doesn’t have modern features like reconnecting to dropped games or more than a 256 color palette.

Its one saving grace that other game series to this day haven’t replicated is randomly generated maps that are usually reasonably playable and fair is perhaps what keeps the replay value up and people playing to this day. I can’t think of any other reason why.

Having played this game for most of my life in its various forms, I wondered if we could predict win rates in 1v1 games.

Data Collection

Conveniently enough, someone already wrote a scraper for Voobly which is the unofficial client that the competitive community uses.

Data Quality

The dataset began with and ended with 411123 1v1 match participants which after cleaning errors gave us 205317 matches.

Unfortunately, the quality of the data suffers from inaccurate elo scores. In most modern games, players can’t make new accounts without buying more game licenses. This isn’t true for AoE2 which is plagued by ‘smurfs’ which are veterans who make new accounts and prey on newbies. Some free games like League of Legends have algorithms to detect false newbies. This one does not and I don’t know how to implement one and assume it’s quite difficult to do. Normally how elo works is every 120 points is a 2:1 difference in expected win chance and each of these bins are 120 points apart.

We’d expect the win rates to go:

12,25,33,50,66,80,88 % if elo is working correctly

But they are more like around 30, 36, 44, 48, 52, 56, 70% as shown below:

Note: y axis win rate numbers are half of what they should be, sorry couldn’t figure out this bug in time:

This is also reflected in the PDP plot for elo difference which should have a much more aggressive slope. Also, there are probably players sandbagging (losing games on purpose) so they can play with their beginner friends and doing the opposite.

This game also uses an ancient hacky mod system without official support and I don’t have time or domain expertise to figure out what they do. There are more than 35 mods and some of them have drastic impacts on win rates like the most popular one shown below:

However, the win rates are fairly consistent with what the community tier list says (which is based on data from all games in all of Voobly’s history). which is good. The difference between the strongest and weakest faction is a lot. Historically, Franks hover around 60% and Khmer around 40% against the field.

Strong: Aztec, Mayan, Frank, Chinese, Viking, Briton, Spanish, Slav, Hun, Japanese, Mongol

Average: Inca, Celt, Byzantine, Portugese, Indian, Persian, Magyar, Berber, Mali, Ethiopian

Weak: Khmer, Korean, Goth, Saracen, Vietnamese, Malay, Burma, Turk, Teuton, Portugese

Unfortunately, we didn’t have a whole lot else to work with. Most of the meta-data like user screen names and links to where the game recording to be found shouldn’t have any effect on the game outcome.

Data Wrangling & Feature Engineering

The data originally had one player per row.

After clearing out all of the columns with errors documented in the readme, I had to shuffle the rows randomly since the odd number players were always the match-winner which caused leakage in the models at first.

And then merge the odd and even rows so each row was both players in the match and make new features like their difference in elo points and civilization strength.

We started with what faction and elo score players had and made some features for the difference in faction strength and elo score.

Fitting Models

We start with a baseline of 50% since in 1v1 there has to be a winner or a loser. I couldn’t get linear regression above 2% of accuracy no matter what permutation of features or testing methods of cross-validation I ran. Presumably, this had something to do with most players playing against opponents of similar skill. The mean elo difference was only 1 point with a standard deviation of 97. 25% and 75% quartiles were -51 and 51 points.

The non-linear models ran much better though compared to baseline. XBGRegression 77%, Random Forest 77% and Decision Tree 73% consistently scored similarly and our ROC AUC hovering around 62% for each model. Below is the curve for Decision Forest.

Feature importance showed that Civ advantage, which is player 1’s civ tier minus player 2’s civ tier wasn’t a very important feature and neither civ tier feature shows up at all. However, removing these 3 from the models lowers accuracy to between 50% and 59% so I’m not really sure why this is.

The false-negative rate for all models was substantially higher than the false-positive rate. I’m not really sure why and this might require domain expertise. Here are the confusion matrices for Decision Tree, Random Forest and XBG Regression:

Decision Tree
Random Forest
XBG Regression

Conclusions

Elo and what civs people are playing compared to their opponent alone make surprisingly accurate to predictors of who can win in the way you’d expect. Comparatively higher relative elo players and stronger factions are more likely to win.

But at the end of the day, most players even on this competitive server play what they want. This is the popularity chart and if you scroll up, it has no correlation with the win rate chart. However, the newer factions (starting at Slav) are all much less popular than the old ones from 1997/1999.

Future Topics of Exploration

This game, being ancient, has players in teams show that with a variety of methods like team name in brackets or followed by a period. Someone who is clever, diligent and can use regex might be able to predict whether or not a person belongs in a team, based on user-name. Players in teams are likely better than players who don’t belong to one.

Any further projects on this game should probably be done on the 2019 release which includes 4 new factions. As soon as that one has game tracker support.

Python Notebook can be found here if you’d like to make a copy and play with it.

--

--