Predictions and Suggestions using Hero Embeddings in Dota 2 — Part 2 — Training Hero Embeddings

(NOTE: The code shared in this post was a product of rapid prototyping. For better and more readable code, you should check the whole script on my github:, I hope to update these snippets soon)

In Part 1, we already reasoned about the key ideas of our model and we gathered the data we need. In this part, we’re training hero embeddings.

Training Hero Embeddings:

Let’s begin with importing libraries we will/might use.

To be able to print in a human-readable format, I’m loading hero names into an array. But if you take a look at the hero_names.json, you’ll notice that, for some reason, a hero with the id 24 doesn’t exist. So we’re fixing that as well.

hero_names.json has the following format and can be found here or you can use this Dota 2 Web API call to pull it.

Now, let’s load the data we saved to disk and write the method to create batches from data. For now, to keep thing simple, we’re not discriminating between games won or games lost. Later on, we’ll introduce weighted inputs for training.

Now, before we go any further, let’s reason about hero embeddings and how to train them. We know in NLP, word2vec’s premise is that similar words occur in similar contexts. Correspondingly, we are going to opt for the idea that similar heroes are picked with similar team compositions and against similar enemies. Below is a diagram we’re going to base our model on. You’ll notice it’s similar to Continuous Bag of Words (CBOW) model for word2vec with an exception that we’re using 2 different contexts (ally and enemy) and concatenating them into a bigger “Game Context”.

Before we build the model, let’s start with creating a Tensorflow graph and some handy variables. Every Nth step, nearest heroes are calculated for the given sample set which I hand picked to give as much insight as possible.

Building model described above:

Finally, we can train our model and save our trained embeddings to disk:

(For all results on this page, I’ve trained models until the loss completely plateaued)

Results of the first attempt:

Initial loss: ~7.8
Final loss: 2.54

What’s wrong with this model?

As you probably noticed, inputs we’re training embeddings on, include every hero in a match, so heroes on losing team get to contribute in embeddings. Thus our loss becomes a bit ambiguous. Ideally, later on, we’d like to use this model to output a hero that maximizes his team’s chances of winning. Weighing losing heroes as much as winning ones is unlikely to help.

We can assume that in pro games all heroes on the winning side is optimal. So one way to deal with this problem emerges instantly, which would be to train only on heroes who are on the winning team. Plus, the loss function takes care of itself since now it’s defined as prediction error for winning heroes.

Results of training on only on winning team:

Initial loss: ~7.9
Final loss: 2.38

This is an improvement over the previous method but maybe not as much as it may seem. Because the loss is defined on a dataset half the size of the dataset we started with, so it’s easier to overfit. Anyways, we’re going to use this model to train initial embeddings for the steps to follow.

Training on smaller datasets

But if you’re trying to train this model on a smaller dataset, instead of just ignoring heroes who lost the match, you can use them to tell the model which hero not to pick for a given line-up. Although I haven’t found a drastic improvement using this method (in terms of prediction results, as we’ll discuss in upcoming part), I believe, it’s because I have a big enough dataset and we’re re-training embeddings in next step anyways. But you might need it depending on dataset you’re training the model on. So here’s a quick intro. (Sorry in advance, Medium doesn’t support LaTeX for some reason!)

Now, remember that we’re using cross entropy loss for this multinomial classification problem (basically what the last stage really is). Let’s look at the equation for cross entropy. (Note: Theta is weights of final layer, x is context as input to final layer, E is embedding table and N is the number of heroes)

We strive to minimize loss, so ideally we want such parameters:

As L(y, y’) is a function of both embedding LUT and final layer weights, in each iteration, Tensorflow calculates gradients with respect to these parameters and updates them slightly in order to get closer and closer to the ideal predictions.

But for heroes who lost their games, we want to maximize the probability that the predicted hero is not given hero. To do this, we can write P(y != h|x) as a function of given variables and derive gradients which maximize that probability.

Remember, minimizing cross entropy loss can be thought of as maximizing log likelihood of observed data according to your model.

From here you could, if you’re so inclined, create a Tensorflow operation and compute gradients of this loss function wrt. Theta and E explicitly.

After all these being said, one other way which is much simpler to implement is to simply weight gradients like this:

At this point, whether or not you chose to implement the penalty for incorrect predictions, you should have a good set of hero embeddings. And we’re ready to use them in the following part.

Like what you read? Give Talha Saruhan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.