Mixture-Kernel Graph Attention Networks for Situation Recognition
This a blog post corresponding our latest work on the use of Graph Neural Networks for Situation Recognition.
Situation recognition is the task of identifying the salient action in an image and the entities participating in it and how they relate to each other. To understand the task better, consider the image shown below.
In our task, the situation in the image is described using a role-frame as shown below.
The role frame describes the salient action jumping the entities participating in the action. Situation recognition can be seen as a super task to tasks like captioning and visual question answering. For example, given a role-frame, we can generate a caption using some template format.
Graph Neural Networks
Graph neural networks are a generalization of convolutional neural networks for a graph data structure.
Consider a Graph G, with the set of nodes A and set of edges E. Convolution on a graph is achieved using two types of operation, ACCUMULATE and COMBINE.
ACCUMULATE: The accumulate function gathers information(messages) from all the neighbors for a given node.
COMBINE: The combine function updates the state of a node using the current state and incoming message.
For a general graph neural network like the one proposed in , the above ACCUMULATE and COMBINE take the following form:
where x_a and m_a are the states and incoming messages for the nodes respectively and “gated update” refers to an LSTM combination. We will refer to W as the kernel matrix throughout the rest of the post.
For a more in-depth understanding of graph neural networks refer to this post
Situation Recognition as Graph Inference problem
To model dependencies between the semantic roles, we transform the task of situation recognition into a graph-based inference problem. Given an instance from the dataset, we instantiate a graph G = (A, B), where A is the set of node and B the set of edges. The nodes in the graph represent the roles associated with the image. The edges in the graph, directed or undirected, encode the dependencies between the roles. The features for the nodes are extracted from a VGG-16 network pretrained to predict the nouns in a given image. For ex. given the image of the jumping horse (Figure 1.) the network is trained to predict jockey, land, fence and outdoors.
Why not use a simple GNN for situation recognition?
In , the authors use a Gated-GNN(GGNN) for the task of situation recognition. This, however, has some major shortcomings. Since the GGNN has only a single kernel, the underlying assumption is that the mechanism of interaction between the roles in an image is identical independent of the context. Also since the edges are equally weighted during training and inference, the model fails to account for variable interaction between the roles.
Mixture Kernel GNN to the rescue
To address the limitations stated above, we propose an extension and generalization of the GNN approach introduced in . We have two key components in the model. An attention module that decides the structure of the graph and kernel construction module that builds a kernel conditioned on the given image and verb. We will explain both of these modules later. First, let's take a look at the propagation mechanism. In our model, the ACCUMULATE function takes the form shown below:
where Wₖ are the basis kernel used for modeling the interactions between the nodes, cₖ are the associated weights of the basis and αₐₐ' is the weight of the edge between a and a’. The weights for the kernels are predicted by a VGG-16 network pretrained on the task of verb prediction given an image.
The COMBINE step is formulated as a gated update similar to .
where rₐ and zₐ are the reset and the update gates, Wz, Wᵣ, Wₕ are the weights of the update functions. Such a state update mechanism allows information to be combined slowly, ensuring that information from previous time steps is not lost.
The overall model is shown below:
Dynamic Graph Structure
The interaction between the nodes in the graph varies depending on what roles are associated with it. To model such dependant interactions, we learn the edge weights using an attention mechanism similar to .
where Wₐₜₜₙ is the attention kernel, a is the attention mechanism and Nₐ is the set of all neighbors of a in the graph.
Depending on the action in the image, the interaction between the role nodes can differ. For example, the interaction between agent and place for skidding is different from what it would be in case of repairing.
To incorporate such image dependant interaction, we model the kernel matrix as a convex combination of basis kernels. For a given image, the model is then trained to learn a set of membership weights for each of the basis kernels as
where K is the number of basis kernels.
For quantitative results, implementation details and ablations studied, refer to our paper at ICCV 2019.