Knowledge Graph Attention Network for Recommendation

11 min readOct 28, 2022

By Arusarka Bose, Kushal Natani, and Shivansh Sethi

Introduction

With the rapid development of the internet and the exponential growth of data, it is difficult for users to pick what interests them. In the current world of Artificial Intelligence, recommendation systems find applications in almost every domain ranging from news portals, social media platforms, e-commerce, music, and search engines. Thus, identifying user preferences and suggesting the best products for them is currently an important area of research.

However, most recommendation systems currently don’t go beyond modeling user-item interactions. Factorization or collaborative methods are based on assumptions that each interaction is an independent instance encoding side information. This causes an overlook of the relations among instances or items. Thus, to provide more diverse, accurate, and explainable recommendations, we need to model side information and create a collaborative signal bringing in many more users.

An example of the collaborative knowledge graph. u1 is the target user to provide recommendations for. The yellow circle and grey circle denote the important users and items discovered by high-order relations but are overlooked by traditional methods. Here r1, r2, r3, and r4 are different relations.

Let’s understand this with an example. A singer in the music industry could be a lyrist or a producer of another song. Treating each instance as an independent interaction means losing information that can provide meaningful recommendations. Knowledge Graphs provide that solution. They link items with their attributes. They ensure capturing of high-order relationships between different providing better recommendations.

In this blog, we bring out Knowledge Graph Attention Network for Recommendations that models these high-order connectivities. They transverse through a node’s neighbor’s embeddings and use it to recalculate the embedding of the node using an attention module to define the importance of the neighbor.

What problems with other recommendation systems led to Knowledge Graph Attention Network?

As said, existing recommendation systems convert information into a generic feature vector, which is then fed into a supervised learning model to predict the recommendation score. This includes Wide&Deep, Neural Factorization Machines(NFMs), etc. However, the problem here is that each interaction is not an independent data instance as is considered by these models. We require them to extract attribute-based collaboration signals from user collective behavior.

**Example from the previous figure highlighting the higher-order relations ignored by FMs, or NFMs**

To completely exploit the higher-order relations, a hybrid structure of knowledge graph and user item graph is defined named as Collaborative Knowledge Graph. However, this brings some issues.

The number of nodes having high-order relationships to the target node grows as the order size grows, creating a computational burden.
Because high-order relations contribute unequally to a prediction, the model must carefully weigh (or choose) them.

The current CKG structures are either path-based or regularization-based. For path-based CKGs, paths carrying high-order information should be extracted and fed into a prediction model. To regularise recommender model learning, regularization-based approaches provide new loss terms that represent the KG structure.

The above methods have their own disadvantages. Path-based CKGs show a huge impact on performance based on the first stage of path selection. Further, defining efficient meta-paths necessitates subject expertise, which can be time-consuming. In the latter method, because of the lack of explicit modeling, neither long-range connectivities nor high-order modeling findings are guaranteed to be recorded.

What are the changes done to handle these issues?

Given the limits of existing methods, we feel it is vital to provide a model capable of using high-order information in KG in an efficient, explicit, and end-to-end manner. KGAT handles the issues discussed above via the following changes:

Recursive Embedding Propagation: This updates a node’s embedding based on the embeddings of its neighbors and does such embedding propagation recursively to capture high-order connectivities in linear time complexity. This handles the requirement of defining efficient meta paths.
Attention-based Aggregation: The neural attention mechanism is used to learn the weight of each neighbor throughout a propagation, such that the attention weights of cascaded propagations can show the significance of high-order connectivity. As a result of explicitly incorporating high-order relationships into the predictive model, all connected parameters are customized to optimize the recommendation target.

In this blog, we do the following:

We iterate through the existing methods of recommendations and to give improved recommendations using item side information, we emphasize the need to clearly model high-order relations in collaborative knowledge graphs.
We explain the Knowledge Graph Attention Network, which enables high-order relation modeling explicitly and end-to-end using the graph neural network architecture.
We examine the efficacy and interpretability of KGAT in recognizing the relevance of high-order links to public benchmarks via some experiments.

Collaborative Knowledge Graph

The aim is a prediction function that predicts the probability that user u would adopt item i. To define it, we start with the following structures:

User-Item Bipartite Graph: Here we represent interaction data as a user-item bipartite graph G.
Knowledge Graph: In addition to the interactions, we have side information for items (e.g. item attributes and external knowledge).

**Interlinks of CKG, especially high-order relations, bring benefits to recommendations and explanations.**

This defines the collaborative knowledge graph. It encodes user behaviors and item knowledge as a unified relational graph. Based on this graph, we aim to predict the interaction probability. Let’s move to the next part to understand how we do so.

Methodology

There are three main parts to this system: the embedding layer, which preserves the CKG’s structure and parameterizes each node as a vector; the attentive embedding propagation layer, which updates a node’s representation by recursively propagating embeddings from its neighbors and uses a knowledge-aware attention mechanism to learn the weight of each neighbor during a propagation; and the prediction layer, which combines the representations of a user and an item from all propagation layers, and outputs the predicted score.

**Illustration of the proposed KGAT model. The left subfigure shows the model framework of KGAT, and the right subfigure presents the attentive embedding propagation layer of KGAT**

Embedding Layer

Through parameterizing entities and interactions as vector representations, while maintaining the graph’s structure, knowledge graph embedding provides a powerful technique. Here, we apply TransR, a frequently employed technique, to CKG. To be more specific, it learns to embed each entity and relation by optimizing the translation principle eʳₕ + eᵣ ≈ eₜ if a triplet (h,r,t) exists in the graph. Herein, eₕ, eₜ ∈ Rᵈ and eᵣ ∈ Rᵏ are the embeddings for h, t, and r, respectively; and eʳₕ, eʳₜ are the projected representations of eₕ and eₜ in the relation r’s space. Hence, for a given triplet (h,r,t), its plausibility score (or energy score) is formulated as follows:

The training of TransR considers the relative order between valid triplets and broken ones and encourages their discrimination through a pairwise ranking loss:

2. Attentive Embedding Propagation Layers

Here, we first describe a single layer, which is made up of three parts: information propagation, knowledge-aware attention, and information aggregation, and then we talk about how to apply it to several levels.

Information Propagation: One entity can be involved in multiple triplets, serving as the bridge connecting two triplets and propagating information. Considering an entity h, we use Nₕ = {(h,r,t)|(h,r,t) ∈ G} to denote the set of triplets where h is the head entity, termed ego-network. To characterize the first-order connectivity structure of entity h, we compute the linear combination of h’s ego-network:

where π(h,r,t) controls the decay factor on each propagation on edge (h,r,t), indicating how much information is being propagated from t to h conditioned to relation r.

Knowledge-Aware Attention: We implement π(h,r,t) via a relational attention mechanism, which is formulated as follows:

where the nonlinear activation function chosen is tanh. As a result, the attention score in the relation r’s space is dependent on the separation between the entities, with closer entities propagating more information. For the sake of simplicity, we just use the inner product on these representations; further investigation of the attention module will be the subject of future work.

The softmax function is then used to normalize the coefficients for all triplets related to h:

In order to collect collaborative signals, the final attention score can advise which neighbor nodes should be given greater focus. The bits of the data that the attention flow recommends focusing on during propagation forward can be viewed as the justifications for the recommendation.

Information Aggregation: The final phase is to aggregate the entity representation eₕ and its ego-network representations as the new representation of entity h. We implement three types of aggregators:

GCN Aggregator sums two representations up and applies a nonlinear transformation as follows:

2. GraphSage Aggregator concatenates two representations, followed by a nonlinear transformation:

3. Bi-Interaction Aggregator is carefully designed by us to consider two kinds of feature interactions between eₕ and its ego-network:

To sum up, the advantage of the embedding propagation layer is that it explicitly uses the first-order connection information to connect the representations of users, items, and knowledge entities.

High-order Propagation: To examine the high-order connection information and collect the information propagated from the higher-hop neighbors, we can stack more propagation layers. Formally, we recursively formulate the representation of an entity in the l-th stages.

wherein the information propagated within the l-ego network for the entity h is defined as follows,

3. Model Prediction

After performing L layers, we obtain multiple representations for user node u analogous to item node i. As the output of the l-th layer is the message aggregation of the tree structure depth of l rooted at u (or i), the outputs in different layers emphasize the connectivity information of different orders. We hence adopt the layer-aggregation mechanism to concatenate the representations at each step into a single vector, as follows:

where ∥ is the concatenation operation. By doing this, we enhance the initial embeddings by embedding propagation processes while also enabling L-adjustable propagation strength adjustment.

4. Optimization

We choose the BPR loss in order to maximize the recommendation model. In particular, it presupposes that observable interactions, which reveal more user preferences, need to be given greater prediction values than unobserved ones:

where O denotes the training set.

Finally, we have the objective function as follows:

where Θ is the model parameter set, and E is the embedding table for all entities and relations. L₂ regularization parameterized by λ on Θ is conducted to prevent overfitting.

Experiments

The suggested methodology is tested using three real-world datasets, focusing in particular on the embedding propagation layer, with an aim to answer the following questions:

How does KGAT perform compared with state-of-the-art knowledge-aware recommendation methods?
How do different components (i.e., knowledge graph embedding, attention mechanism, and aggregator selection) affect KGAT?
Can KGAT provide reasonable explanations about user preferences toward items?

Dataset Description

The effectiveness of KGAT is evaluated on the following benchmark datasets:

Amazon-book: Amazon book is a widely used dataset containing data of over 70k users and 24k books.
Last-FM: A music listening dataset collected from Last.fm online music systems, containing over 23k users and 48k tracks with over 3 million interactions between users and tracks.
Yelp Dataset: The yelp dataset contains data about reviews of restaurants by various users. It contains data from over 45k users and 45k restaurants, with over 1 million interactions between them.

In each dataset, users having at-least 10 interactions are retained and the rest are filtered. This is done to ensure the quality of the dataset.

Experimental Settings

To evaluate the effectiveness of top-K recommendation and preference ranking, two evaluation protocols are calculated: recall@K and NDCG@K. By default, K is set to 20. We report the average metrics for all users in the test set.

Performance Comparison

To answer the question about the performance of KGAT compared to other state-of-the-art models, we compare the performance of KGAT with the following models: FM, NFM, path-based models (MCRec and RippleNet), and graph neural network-based (GC-MC). The results are illustrated in the graphs below:

We observe that KGAT consistently yields the best performance on all the datasets. In particular, KGAT improves over the strongest baselines w.r.t. recall@20 by 8.95%, 4.93%, and 7.18% in Amazon-book, Last- FM, and Yelp2018, respectively.

KGAT is able to explicitly explore the high-order connectivity by stacking numerous attentive embedding propagation layers and hence capture collaborative signals effectively. This verifies the significance of capturing collaborative signals to transfer knowledge. KGAT also justifies the effectiveness of the attention mechanism by performance improvement over GC-MC.

Study of KGAT

We answer the second question “How do different components affect KGAT?” by exploring the influence of varying layer numbers and using different aggregators on the performance of KGAT. We get the following observations:

Effect of Model Depth: The performance can be significantly improved by deepening the KGAT, but only to a certain point: KGAT-2 and KGAT-3 significantly outperform KGAT-1, whilst KGAT-4 outperforms KGAT-3 only marginally.
Effect of Aggregators: To explore the impact of aggregators, we study the KGAT-1 variations using various aggregators: more notably GCN, GraphSage, and Bi-Interaction. We observe the following results: KGAT1-GCN is consistently superior to KGAT1-GraphSage, which might be because GraphSage forgoes the interaction between the entity representation and its ego-network representation. It illustrates how crucial feature interaction is to information aggregation and propagation.

Case Study

By taking advantage of the attention mechanism, we may use high-order connection reasoning to infer user preferences for the target item and provide reasonable explanations. We randomly select one user from Amazon-Book, and one of its relevant items (from the test, unseen in the training phase). Based on the attention scores, we derive behavior-based and attribute-based high-order connectivity connecting the user-item pair as shown in the figure below.

**High-order connectivity of users from the amazon book dataset**

KGAT captures the behavior-based and attribute-based high-order connectivity, which plays a key role to infer user preferences. Hence the retrieved paths can be viewed as evidence of why the item meets the user’s preference.

Conclusion

We investigated the high-order connection with semantic relations in CKG for the knowledge-aware recommendation. In order to formally represent the high-order connectivities in CKG in an end-to-end manner, we developed a new framework KGAT. The attentive embedding propagation layer, which adaptively propagates the embeddings from a node’s neighbors to update the node’s representation, is at the center of the system. Extensive tests on three real-world datasets show that KGAT is logical and efficient. This summarizes the usage of the Knowledge Graph Attention Network for better recommendations.

We hope you enjoyed reading this blog. Thank you.

Video Link is also added here for your convenience.