Predictions and Suggestions using Hero Embeddings in Dota 2 — Part 3 — Predicting Match Outcomes

(NOTE: The code shared in this post was a product of rapid prototyping. For better and more readable code, you should check the whole script on my github:, I hope to update these snippets soon)

In Part 1 and Part 2, we have trained our embeddings and saved them to the disk. In this part, we’re going to build a model that predicts match outcomes using the pre-trained embeddings.


At this stage, I started with a standard feed forward network and iterated through different number of layers. Besides standard feed forward network procedure, one particularly interesting experiment was to add residual connections. At the time I coded this project, I was reading the ResNet paper, and I decided to try out the idea on my model. Unfortunately the improvements made by residual connections couldn’t dwarf the previous performance. Although on average there seemed to be a slight improvement over non-residual network (comparing to non-residual network with comparable number of parameters). Which is, while disappointing, not surprising.

Back to the model, assuming that being radiant or dire side have no impact on the game for a team, we can swap the teams and invert the outcome of the game to get the most out of data we have. Although this assumption can be challenged, it seems strong enough that I will not go into the proof. If you wanted to, you could cluster similar team compositions and examine the consistency of wins for a particular side to empirically prove this assumption.

Starting with loading up data from disk then organizing the match data and writing a function that feeds batches into our model.

Next step is a very simple feed forward network. You’ll notice I explicitly wrote down matrix multiplication and activations for each layer. If you’re bothered by the mess, you could wrap all the layers in a loop, but I find it easier to tweak things with explicit declarations, which is why I left it this way.

In the code above, I’m using some hyperparameters we haven’t declared yet, like useDropout, useL2, alpha, Pkeep and even hidden_neurons_*. Since at some point I want to exhaustively search for the best parameters within a range, I’m taking them directly from the program arguments. Now, you might say the number of layers could be a hyperparameter as well, and that’s a valid point, but as I didn’t have much time or financial resources to test this forever, I settled on 7 hidden layers. If you want to play around or add the number of layers as a hyperparameter as well, it should be relatively straightforward to do so. Here’s the (first) part of the actual script that reads parameters from command line:

Also, since it takes a considerable amount of time to train the network, we want to use Tensorflow’s Saver object to periodically save our model.

Again if you want, you can wrap hidden layers in a nice loop using exec() in both of snippets above. But at the time, I changed code too frequently for this to be a reasonable thing to do.

Finally, it’s time to train the model, you’ll notice I’m keeping track of loss and accuracy, as we’ll plot accuracy-time graph after the training is done to help us visualize the training process. One thing to keep in mind is that, even though we hit a plateau in accuracy fairly quickly, loss is going down. That’s because our accuracy metric doesn’t take confidence of the prediction into account, while, obviously, cross entropy loss does.

Also, in the snippet below I'm not using cross validation. But that's only because the problem is too complicated to overfit to validation set through my tweaks. I did implement cross validation initially but the variation between validation and test accuracies were well within the expected variation due to data distribution each time I ran the experiment. If you want to implement it, it's as simple as copying a couple lines of code regarding the validation set .

To evaluate the test cases, I wrote a python script to test 4 cases at parallel on an 8 core CPU machine. (Even when running one at a time, the script won’t use more than 2 cores) Yet, it still took a long time to compute all the results (I believe running multiple jobs at a time hinders the performance, as my CPU cache is very limited). So, be aware before you go crazy on the number of scenarios to try.

Among 50 or so possible sets of hyperparameters I set with hand, this is the graph corresponding to the best result I found:

This particular model coincides to the parameters:

beta_L2 = 0.01, Pkeep = 0.95, hidden layer sizes = [128, 84, 48, 32, 16]

All these being said, I’m certain that one can train this kind of model with somewhat similar hyperparameters and get fairly better results. But hopefully, this concludes this part of the series as it demonstrates that representing heroes in a feature space outperforms the one-hot representations for the job by a far margin.

I’m hoping to do one more part for this series regarding a suggestion engine using the pre-trained embeddings. Until then, take care.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.