Scene Graph Generation, Compression, and Classification on Action Genome Dataset

Published in

Stanford CS224W: Machine Learning with Graphs

12 min readJan 10, 2022

A step-by-step tutorial for applying graph ML to perform scene graph generation, graph compression, and action classification tasks on the Action Genome dataset.

This blog post was co-authored by Zhuoyi Huang and Tina Li as part of the Stanford CS224W course project in Autumn 2021.

Introduction

In this tutorial, we will walk you through the detailed steps to perform scene graph generation, graph compression, and action classification tasks on the Action Genome dataset. We will preprocess a subset of frames from the dataset, generate a scene graph for each frame, and classify each scene graph by an action label, which we choose to be “using the phone.”

Before we start, let’s clear some key definitions and models of our tasks.

Task 1: Scene Graph Generation

Scene graph generation takes an image as input and generates a visually-grounded scene graph. A scene graph is defined to be a directed graph, where the nodes represent the objects and the edges represent their pairwise relationships. Specifically, we formulate the scene graph generation task into two sub-parts — node prediction and edge prediction.

Figure 1 shows an overview of scene graph generation, where the bottom-right corner is a generated scene graph with objects (blue nodes) and relationships (red nodes).

Figure 1: Scene graph generation overview

Node Prediction

First, we apply an object detection benchmark adapted from Mask R-CNN and detect the bounding boxes (red rectangles) and object labels (texts on the top-left corner of rectangles) for each image.

Figure 2: Mask R-CNN detecting bounding boxes and labels of objects

Edge Prediction

Second, given the objects as nodes, we apply a scene graph benchmark model to predict the relationships between any two nodes as the edges. For example, has, behind, on. etc are some possible relationships for nodes in figure 2. The predicted nodes and the edges together form the scene graph of this image.

Task 2: Graph Compression

Taking the generated scene graphs as inputs, graph compression reconstructs each original full graph into a smaller graph representation.

Effective graph compression can help to simplify the visualization and clearly represent the high-level structure of large graph data. Graph compression also helps to improve downstream task performance, e.g., graph classification task across action labels.

We apply a recently proposed model Multi-kernel Inductive Attention Graph Autoencoder (MIAGAE) adapted from Graph Autoencoder for Graph Compression and Representation Learning (2021), which has an encoder E and a decoder D. To obtain graph compression and reconstruction ability, E first learns to eliminate graph nodes and edges to get a compressed smaller graph, then D learns to add new nodes and reconstruct the original graph.

As shown in figure 4, the encoder has pairs of Multi-kernel Inductive Graph convolution layer and Similarity Attention Graph Pooling layer. The decoder has inductive Un-pooling layers. We will discuss the implementation details in later steps.

Figure 4: Graph compression MIAGAE model

Task 3: Action Classification

Finally, we classify the scene graphs for the action label “using the phone” to evaluate the performance of our scene graph generation and graph compression models. Specifically, frames can either be classified as “using a phone” or “not using a phone.”

A Graph Convolutional Network (GCN) is an approach for semi-supervised learning on graph-structured data, first proposed by Semi-Supervised Classification with Graph Convolutional Networks (2017). We will apply a GCN classification model consisting of 2 graph convolutional layers and 4 layers of multilayer perceptron (MLP) to output the final classification result.

Figure 5: Graph Convolutional Networks (GCN) model proposed by Kipf and Welling

Action Genome Dataset Overview

Our dataset is proposed by Action Genome: Actions as Composition of Spatio-temporal Scene Graphs (2019). Action Genome is a video database with an aim to bridge human actions and human-object relationships.

Here is a list of important statistics about the dataset.

10K Videos
265K Labeled Frames
157 Action Categories
583K Bounding Boxes of Interacted Objects
0.4M Object Instances with 35 Object Categories
1.7M Human-Object Relationship Instances with 25 Relationship Categories

We choose the Action Genome dataset because it is a large dataset proposed fairly recently with clean labels and annotations. We can take advantage of this useful dataset and apply a wide range of graph ML techniques to solve real-world challenges.

Figure 6: Labeled frames for “sitting on a sofa” action from Action Genome

Step by Step Google Colab Tutorial

Now, we have looked at an overview of our tasks and the Action Genome dataset. Let’s start our implementation, from data preprocessing to training and testing. Please refer to the Google Colab below and follow steps 1–5.

Short link to our Google Colab Tutorial: shorturl.at/kvI05

1. Data Preprocessing on Action Genome

The first step is to prepare the Action Genome dataset that we are going to generate scene graphs, compress scene graphs, and classify actions on.

The frames can be directly downloaded from Action Genome’s website. For your convenience, we also uploaded the frames onto a Google Drive folder. Please add a shortcut to your Drive, so you can mount it following the Colab.

For the purpose of this tutorial, we use a small subset of frames in ActionGenome/dataset/ag/frames_bicls_2 (400 images) from the full dataset (265K images).

2. Environment Configuration

For scene graph generation, we use similar requirements as the maskrcnn-benchmark paper. Please refer to their Q&A page if you encounter issues.

For graph compression and action classification, we mainly use PyG (PyTorch Geometric), which is useful for implementing Graph ML methods and can be applied to a wide range of applications involving structured data.

Here is a detailed list of requirements:

PyG (PyTorch Geometric)
PyTorch ≥ 1.2.0
torchvision ≥ 0.4.0
CUDA ≥ 10.0
cocoapi
apex
ninja
yacs
cython
tqdm
openCV
matplotlib
GCC ≥ 4.9

3. Scene Graph Generation

Scene graphs are a representation of image information in graph form, which encodes objects as nodes and their pairwise relationships as edges.

3.1 Inference Using Pre-trained Model

In this tutorial, we focus on how to apply a pre-trained scene graph generation model to implement graph compression and action classification pipelines using PyG.

To generate scene graphs on the 400 frames of Action Genome, we use a pre-trained state-of-the-art Neural Motifs model based on Neural Motifs: Scene Graph Parsing with Global Context (2018). We also use the SUM fusion function and Total Direct Effect (TDE) analysis framework based on Unbiased Scene Graph Generation from Biased Training (2020).

As shown in figure 7, the final predicate logits Y is generated by using the SUM fusion function that sums together the inputs from three branches I (image), X (object features), and Z (object labels). For evaluation, we use Relationship Retrieval metrics and Scene Graph Detection (SGDet) task, which detects scene graphs from scratch. The model was trained using the conventional cross-entropy losses of object labels and predicate labels.

Figure 7: Unbiased scene graph generation model by *Kaihua Tang et al.*

The Total Direct Effect (TDE) method directly separates the bias from existing models without training additional layers to model the bias. Please take a look at this post for Eliminating Bias from Scene Graph Generation on the details of this unbiased scene graph generation model.

Figure 8: TDE calculation by Kaihua Tang et al.

Below is a command for running the pre-trained model we choose (MOTIFS, SUM, and TDE). If you want to try other models, update the corresponding arguments. This command runs inference and saves custom_data_info.json and custom_prediction.json files for visualization and compression steps later.

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \
--master_port 10027 --nproc_per_node=1 tools/relation_test_net.py \
--config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" \
MODEL.ROI_RELATION_HEAD.USE_GT_BOX False \
MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False \
MODEL.ROI_RELATION_HEAD.PREDICTOR CausalAnalysisPredictor \
MODEL.ROI_RELATION_HEAD.CAUSAL.EFFECT_TYPE TDE \
MODEL.ROI_RELATION_HEAD.CAUSAL.FUSION_TYPE sum \
MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER motifs \
TEST.IMS_PER_BATCH 1 \
DTYPE "float16" \
GLOVE_DIR /content/drive/MyDrive/cs224w/glove \
MODEL.PRETRAINED_DETECTOR_CKPT /content/drive/MyDrive/cs224w/checkpoints/upload_causal_motif_sgdet\OUTPUT_DIR /content/drive/MyDrive/cs224w/checkpoints/upload_causal_motif_sgdet\TEST.CUSTUM_EVAL True \TEST.CUSTUM_PATH /content/drive/MyDrive/cs224w/ActionGenome/dataset/ag/frames_bicls \
DETECTED_SGG_DIR /content/drive/MyDrive/cs224w/ActionGenome/dataset/ag/anno_frames_bicls

3.2. Visualization of Generated Scene Graphs

The pre-trained model gives accurate results on bounding boxes, object detections, and relationship predictions. It can precisely detect person, objects, and their pairwise relationships in indoor settings.

We defined a visualize_scene_graph method, where image_idx ranges from 0 to 499 to select an image from 500 frames, box_topk selects the number of bounding boxes, and rel_topk selects the number of relationships. This method shows the results of both object detection (bounding boxes and object labels) and edge predictions (relationship labels), i.e., a scene graph.

visualize_scene_graph(image_idx, box_topk, rel_topk)

Figure 9 shows an example of the generated scene graph for a frame with label 1 for action “using the phone”. All of the box and relationship labels are precise, and the bounding boxes for smaller objects have accurate borders (e.g., ear, phone, etc). This example shows that the model performs well on people and key objects, so it can be applied to custom datasets for human-object interactions.

Figure 9: Scene graph generated for image #25 from ag/frames_bicls

Relationship labels are directed edges, as shown in figure 10. The “phone in hand” relationship label has the highest score and is an important edge prediction for future compression and classification steps.

Figure 10: Relationship labels for image #25 from ag/frames_bicls

4. Graph Compression

Now, we have converted 400 frames to scene graph structures with node and edge labels. Next, we can implement our training and testing pipelines for graph compression using PyG.

4.1 Model Definition

The Multi-kernel Inductive Attention Graph Autoencoder (MIAGAE) model consists of an encoder network E and a decoder network D. Instead of compressing nodes/edges separately, the MIAGAE model utilizes the node similarity and graph structure to compress all nodes and edges as a whole.

Figure 11: The MIAGAE model with an encoder and a decoder

4.2 Encoder Network E

The encoder uses multi-kernel inductive graph convolution (MI-Conv) and similarity attention graph pooling (SimAGPool) layers.

(1) MI-Conv uses multiple kernels, each with its corresponding transformation weights W1 and W2, which is defined by the SGAT layer (class SGAT) in our Colab. We use formulas below for features extracted by the first kernel and final features aggregating all m kernels, where σ is a non-linear activation function (we use ReLU) and Aggre is an aggregate function combining multi-kernel results (we use addition).

Figure 12: MI-Conv feature extraction for the first kernel and aggregating all m kernels

For each MI-Conv layer, we initialize a gat_list to store multiple SAGEAttn layers, which are defined by class SAGEAttn in our Colab. SAGEAttn is a single-kernel Inductive-Conv layer and inspired by GraphSAGE in Inductive Representation Learning on Large Graphs (2018). The forward process is performed as:

Figure 13: Forward of SAGEAttn from GraphSAGE

(2) SimAGPool selects the most representative nodes with the most information by using the similarity and topology among nodes, which is defined by the Pooling layer (class Pooling) in our Colab. It achieves downsampling on graph data and adaptively selects a subset of nodes/edges to form a new but smaller graph.

Figure 14: Similarity Attention Pooling (SimAGPool) Layer

4.3 Decoder Network D

The decoder uses inductive un-pooling layers, which is defined by the SGConv layer (class SGConv) in our Colab.

SGConv is a symmetric implementation for the decoder as SGAT for the encoder. The inductive Un-pooling layer has the same parameter settings as the Inductive-Conv layer in E.

The number of SGConv layers is also the same as the number of SimAGPool layers in E. SGConv has two inputs: (1) the output graph from the previous layer and (2) the edges information of new nodes, which is the same as the eliminated edges in the corresponding SimAGPool layer in E.

4.4 Training/Testing Pipeline for Graph Compression

We used the following hyperparameters and run the command below for training the compression model.

batch size: 40
number of epochs: 300
learning rate: 1e-4
number of samples for train set: 280
number of samples for test set: 120
compression rate: 0.85

python train_compression.py \
--d ag \
--batch 40 \
--e 300 \
--lr 1e-4 \
--n_train 280 \
--n_test 120 \
--c_rate 0.85 \
--model_dir '/content/drive/MyDrive/cs224w/Graph_AE/data/model'

With a compression rate set to 0.85, the model has fairly low training and test losses around 0.35. This suggests the graph compression model has a great encoder and decoder, so it can effectively compress the original graph into a smaller graph structure while not losing much of the original graph data.

Figure 15: Train and test losses of the graph compression model

5. Action Classification

Lastly, to evaluate the performance of our models, we apply a GCN classification model to output the final classification result. Our model takes the scene graphs after compression and classifies the action as either “using a phone” or “not using a phone.”

5.1 Training/Testing Pipeline for Action Classification

The classifier model has 2 graph convolutional layers with node feature dimension 64, followed by 4 layers of MLP with batch normalization.

We use the same hyperparameters as step 4.4 and run the command below for training the classifier.

python train_classifier.py \
--d ag \
--n_skip 0 \
--batch 40 \
--e 300 \
--lr 1e-4 \
--n_train 280 \
--n_test 120 \
--c_rate 0.85 \
--model_dir '/content/drive/MyDrive/cs224w/Graph_AE/data/model'

With a compression rate set to 0.85, the final train accuracy is around 0.60 and test accuracy is around 0.56. This suggests our classifier is performing fairly better than randomly guessing the actions of 400 scene graphs.

Figure 16: Highest train and test accuracies of the action classification model

Analysis and Limitations

The graph-based downstream tasks (i.e., action classification) on these graph structures can be further improved if we are given more time and additional computational resources besides Google Colab.

We expect the accuracies to be higher if we can run our models on the full Action Genome dataset (265K frames across all 157 action categories) because we have already obtained great results using a small subset (400 frames) to classify a binary label (i.e., the “using the phone” action). We look forward to further improving our classification performance if more dataset is used for training/testing.

More importantly, it is a fairly difficult task to classify an action only through a single frame. One possible solution can be utilizing a sequence of continuous frames in a video and synthesis the temporal/spatial information together to do a high-level action classification. There are also many other graph ML techniques that can be rather helpful, such as object tracking and dynamic scene graphs.

For the purpose of this tutorial, we want to focus on how to format customized image datasets into graph structures (i.e., scene graph generation) and demonstrate how to do graph compression regarding such complex real-life digital graphs.

Conclusion

We have covered a detailed, step-by-step tutorial on using graph ML techniques and PyG to perform scene graph generation, graph compression, and action classification tasks on 400 labeled frames of the Action Genome dataset.

Now, you have learned to (1) represent any custom image dataset as a scene graph structure (with node and edge labels), (2) compress a complicated graph into a simpler structure, and (3) classify a graph by a particular label.

You can apply what you learned from this tutorial to various domains and answer challenging questions involving structured data with graph ML.

Reference

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs

Action recognition has typically treated actions and activities as monolithic events that occur in videos. However…

arxiv.org

Unbiased Scene Graph Generation from Biased Training

Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g…

arxiv.org

Scene Graph Generation by Iterative Message Passing

Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects…

arxiv.org

Neural Motifs: Scene Graph Parsing with Global Context

We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role…

arxiv.org

Graph Autoencoder for Graph Compression and Representation Learning

Graph Autoencoder for Graph Compression and Representation Learning Keywords: Graph Autoencoder, Graph Compression…

openreview.net

Semi-Supervised Classification with Graph Convolutional Networks

We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient…

arxiv.org

Scene Graph Generation, Compression, and Classification on Action Genome Dataset

Introduction

Task 1: Scene Graph Generation

Node Prediction

Edge Prediction

Task 2: Graph Compression

Task 3: Action Classification

Action Genome Dataset Overview

Step by Step Google Colab Tutorial

1. Data Preprocessing on Action Genome

2. Environment Configuration

3. Scene Graph Generation

3.1 Inference Using Pre-trained Model

3.2. Visualization of Generated Scene Graphs

4. Graph Compression

4.1 Model Definition

4.2 Encoder Network E

4.3 Decoder Network D

4.4 Training/Testing Pipeline for Graph Compression

5. Action Classification

5.1 Training/Testing Pipeline for Action Classification

Analysis and Limitations

Conclusion

Reference

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs

Action recognition has typically treated actions and activities as monolithic events that occur in videos. However…

Unbiased Scene Graph Generation from Biased Training

Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g…

Scene Graph Generation by Iterative Message Passing

Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects…

Neural Motifs: Scene Graph Parsing with Global Context

We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role…

Graph Autoencoder for Graph Compression and Representation Learning

Graph Autoencoder for Graph Compression and Representation Learning Keywords: Graph Autoencoder, Graph Compression…

Semi-Supervised Classification with Graph Convolutional Networks

We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient…

Written by Tina Li