Attention in Recommendation Systems

Pre-Training, Part 3: Self-Attention

Published in

AI³ | Theory, Practice, Business

6 min readMay 12, 2020

http://jalammar.github.io/illustrated-transformer/

In my last article on attention in pre-training, I mentioned that I used the attention mechanism in an ongoing project at game services company Yodo1.

However, the form of attention I explained in Pre-Training Part 2 is more basic than the one I ended up using for this project.

The form I used is called self-attention.

Today, I will explain why I used this form of attention and give examples from my project to show you how self-attention helps recommendation systems express the relationship between game features — such as items for purchase and attainable levels — in a more reasonable and effective way.

First: What are recommendation systems, and how do we use them in games?

What Are Recommendation Systems?

Recommendation systems use machine learning algorithms to predict whether users will like certain items and unless you’ve been living in the stone age, you’ve definitely made use of them.

Perhaps you’ve experienced the following: You open Google Chrome, but you’re not sure what you’re looking for. In the welcome tab, you see “articles for you” and a topic of interest catches your eye. You open the link and read the story.

How did Google know you would want to receive this bit of news? Thanks to its recommendation system.

Recommendation Systems in Mobile Games

Turning to the gaming industry, recommendation systems come in handy when players want to, for example, buy new equipment but don’t know exactly what they want yet.

And recommendations have a definite impact: according to our data, more than 40% of people will directly buy the equipment that has been recommended to them.

How do recommendation systems make games better?

By giving new players suggestions on where to start, while saving established players time by making the items they’re interested in easier to find, recommendation systems make the gaming experience easier to enjoy.

At Yodo1, our goal is to design the recommendation system in such a way that as many players as possible can benefit from it and thereby enjoy an enhanced gaming experience.

For the game I’ve been working on, we tried to create a mixed recommendation system: one that uses multiple methods and models, from multiple perspectives, to make its recommendations.

Since I’m not at liberty to disclose this game’s title, let’s refer to it as “Robot World”.

In Robot World, players build a team of robots and equip them with tools and special skills to attack other players’ headquarters. Players need to know the robots pretty well to build a unified team.

We on the developer team can see the types of robots and skills players generally prefer to use and we understand how many choices they face when trying to optimize their team.

With so many decisions to make, players need some form of guidance or complementary opinions to improve their team’s performance and increase their chances of winning.

Problem Solving: Making a Computer Understand Complex Relationships

To effectively guide our players in their decision making, we need to understand and quantify the delicate relationship between different characters’ skills and the equipment available to them.

To win the game, a player needs to find new characters and skills that best fit the team they’ve built.

Thus, the problem we developers must solve is how to express the relationship between various game resources in a way a computer can understand — so that we can use it in our recommendation system.

Of course, we can use records of characters and bundles that players have obtained in the past to speculate on this relationship. But which machine-learning model is most suited to the job?

Why We Need Self-Attention

To an extent, game character acquisition and bundle purchases have sequence features and LSTM can reflect character acquisition sequence and bundle acquisition sequence in terms of time.

We could use LSTM’s description of sequence to reflect the relationship between various game resources, but it only gives us the perspective of time and is not straightforward enough.

Then there’s CNN, which extracts local features and seems to reflect the relationship between neighboring features, but it is not flexible enough.

Combined features are an element of the recommendation system we cannot ignore. This refers to the mix of game characters, skills, and/or equipment and reflects the relationship between various game resources.

If we can rightly reflect the combination of features and relationships between them in our recommendation system, there will be an obvious improvement in the system’s performance.

What we need to achieve this is a solution that is direct and flexible.

And that’s exactly where self-attention joins the party, because it meets my requirements very well. Plus, it’s not limited by the disappearance of gradients or adjacent elements and its characteristics allow for serialization, which can increase training speed.

So that settles it: we’re going to solve the problem with self-attention.

A Step-by-Step Guide to Using Self-Attention

Here is how the main process of self-attention works:

(Bear in mind that in our example, features refer to game characters, skills, and equipment bundles that our players can buy.)

Step 1: Get feature embedding. Obtain the embedding vector of each feature in the embedding layer.

Step 2: Initialize Q, K, V matrix. Q(query) matrix, K(key) matrix, V(value) matrix are the trainable weights in the neural network. We use them to transform each feature embedding into three kinds of vectors to calculate attention weights. We can initialize the three matrices randomly and it will give us the optimized result after training iterations.

Step 3: Get q, k, v vectors. Let’s take one feature, say the character “Robot1”, as an example. We get its feature embedding “Robot1 embedding vector” in step 1. Now, by multiplying the “Robot1 embedding vector” with the Q, K, V matrix in step 2, we transform the “Robot1 embedding vector” into three kinds of vectors: the “Robot1 query vector” (q1), the “Robot1 key vector” (k1), and the “Robot1 value vector” (v1). Apply this step to all the other features so you get q2, k2, v2, q3, k3, v3, … qn, kn, vn.

Step 4: Calculate attention weights. For “Robot1”, its attention weights are obtained by calculating the dot product (w1) of q1, k1, the dot product (w2) of q1, k2, the dot product (w3) of q1, k3, and so on until qn, kn. Now we have weights of “Robot1” ( w1, w2, w3, … wn). These weights can reflect the relationship between “Robot1” and other features in the game.

Step 5: Normalize attention weights. Dot production w1, w2, w3, … wn are not the numbers limited within 0–1. So we can use softmax to make them stay within 0–1.

Step 6: Multiply attention weights with v. Do you still remember v1, v2, v3, … vn in step 3? We can multiply w1 with v1, w2 with v2, w3 with v3, … wn with vn. Now we get a list of new vectors weighted_v1, weighted_v2, weighted_v3, … weighted_vn.

Step 7: Sum up the weighted vectors. Sum up weighted_v1, weighted_v2, weighted_v3, … weighted_vn and then we get one vector z1 which represents “Robot1” with attention information added. By repeating steps 3 to 7, we can have z2, z3, … zn. Now, we have successfully transformed the original feature embedding in step 1 into vectors with attention in it.

And voila: self-attention in a nutshell.

In my project, the vectors discussed above can be seen as a new representation of each game feature — one that includes information on how related it is to all the other features.

Therefore, we have solved the problem of how to express the relationship between game resources in a way the computer can understand.

In Conclusion

Self-attention is the basis for understanding pre-training technologies such as Bert and XLnet. With a good understanding, you can use it in more flexible ways, too, such as the recommendation system we created today. For any practitioner of AI, it is a key concept we can’t afford to ignore.

Next up, we’ll dive deeper into our discussion of pre-training technologies with another example from my work at Yodo1. Don’t miss it!