Scene Graph Generation, Compression, and Classification on Action Genome Dataset
A step-by-step tutorial for applying graph ML to perform scene graph generation, graph compression, and action classification tasks on the Action Genome dataset.
This blog post was co-authored by Zhuoyi Huang and Tina Li as part of the Stanford CS224W course project in Autumn 2021.
Introduction
In this tutorial, we will walk you through the detailed steps to perform scene graph generation, graph compression, and action classification tasks on the Action Genome dataset. We will preprocess a subset of frames from the dataset, generate a scene graph for each frame, and classify each scene graph by an action label, which we choose to be “using the phone.”
Before we start, let’s clear some key definitions and models of our tasks.
Task 1: Scene Graph Generation
Scene graph generation takes an image as input and generates a visually-grounded scene graph. A scene graph is defined to be a directed graph, where the nodes represent the objects and the edges represent their pairwise relationships. Specifically, we formulate the scene graph generation task into two sub-parts — node prediction and edge prediction.
Figure 1 shows an overview of scene graph generation, where the bottom-right corner is a generated scene graph with objects (blue nodes) and relationships (red nodes).
Node Prediction
First, we apply an object detection benchmark adapted from Mask R-CNN and detect the bounding boxes (red rectangles) and object labels (texts on the top-left corner of rectangles) for each image.
Edge Prediction
Second, given the objects as nodes, we apply a scene graph benchmark model to predict the relationships between any two nodes as the edges. For example, has, behind, on. etc are some possible relationships for nodes in figure 2. The predicted nodes and the edges together form the scene graph of this image.
Task 2: Graph Compression
Taking the generated scene graphs as inputs, graph compression reconstructs each original full graph into a smaller graph representation.
Effective graph compression can help to simplify the visualization and clearly represent the high-level structure of large graph data. Graph compression also helps to improve downstream task performance, e.g., graph classification task across action labels.
We apply a recently proposed model Multi-kernel Inductive Attention Graph Autoencoder (MIAGAE) adapted from Graph Autoencoder for Graph Compression and Representation Learning (2021), which has an encoder E and a decoder D. To obtain graph compression and reconstruction ability, E first learns to eliminate graph nodes and edges to get a compressed smaller graph, then D learns to add new nodes and reconstruct the original graph.
As shown in figure 4, the encoder has pairs of Multi-kernel Inductive Graph convolution layer and Similarity Attention Graph Pooling layer. The decoder has inductive Un-pooling layers. We will discuss the implementation details in later steps.
Task 3: Action Classification
Finally, we classify the scene graphs for the action label “using the phone” to evaluate the performance of our scene graph generation and graph compression models. Specifically, frames can either be classified as “using a phone” or “not using a phone.”
A Graph Convolutional Network (GCN) is an approach for semi-supervised learning on graph-structured data, first proposed by Semi-Supervised Classification with Graph Convolutional Networks (2017). We will apply a GCN classification model consisting of 2 graph convolutional layers and 4 layers of multilayer perceptron (MLP) to output the final classification result.
Action Genome Dataset Overview
Our dataset is proposed by Action Genome: Actions as Composition of Spatio-temporal Scene Graphs (2019). Action Genome is a video database with an aim to bridge human actions and human-object relationships.
Here is a list of important statistics about the dataset.
- 10K Videos
- 265K Labeled Frames
- 157 Action Categories
- 583K Bounding Boxes of Interacted Objects
- 0.4M Object Instances with 35 Object Categories
- 1.7M Human-Object Relationship Instances with 25 Relationship Categories
We choose the Action Genome dataset because it is a large dataset proposed fairly recently with clean labels and annotations. We can take advantage of this useful dataset and apply a wide range of graph ML techniques to solve real-world challenges.
Step by Step Google Colab Tutorial
Now, we have looked at an overview of our tasks and the Action Genome dataset. Let’s start our implementation, from data preprocessing to training and testing. Please refer to the Google Colab below and follow steps 1–5.
1. Data Preprocessing on Action Genome
The first step is to prepare the Action Genome dataset that we are going to generate scene graphs, compress scene graphs, and classify actions on.
The frames can be directly downloaded from Action Genome’s website. For your convenience, we also uploaded the frames onto a Google Drive folder. Please add a shortcut to your Drive, so you can mount it following the Colab.
For the purpose of this tutorial, we use a small subset of frames in ActionGenome/dataset/ag/frames_bicls_2 (400 images) from the full dataset (265K images).
2. Environment Configuration
For scene graph generation, we use similar requirements as the maskrcnn-benchmark paper. Please refer to their Q&A page if you encounter issues.
For graph compression and action classification, we mainly use PyG (PyTorch Geometric), which is useful for implementing Graph ML methods and can be applied to a wide range of applications involving structured data.
Here is a detailed list of requirements:
- PyG (PyTorch Geometric)
- PyTorch ≥ 1.2.0
- torchvision ≥ 0.4.0
- CUDA ≥ 10.0
- cocoapi
- apex
- ninja
- yacs
- cython
- tqdm
- openCV
- matplotlib
- GCC ≥ 4.9
3. Scene Graph Generation
Scene graphs are a representation of image information in graph form, which encodes objects as nodes and their pairwise relationships as edges.
3.1 Inference Using Pre-trained Model
In this tutorial, we focus on how to apply a pre-trained scene graph generation model to implement graph compression and action classification pipelines using PyG.
To generate scene graphs on the 400 frames of Action Genome, we use a pre-trained state-of-the-art Neural Motifs model based on Neural Motifs: Scene Graph Parsing with Global Context (2018). We also use the SUM fusion function and Total Direct Effect (TDE) analysis framework based on Unbiased Scene Graph Generation from Biased Training (2020).
As shown in figure 7, the final predicate logits Y is generated by using the SUM fusion function that sums together the inputs from three branches I (image), X (object features), and Z (object labels). For evaluation, we use Relationship Retrieval metrics and Scene Graph Detection (SGDet) task, which detects scene graphs from scratch. The model was trained using the conventional cross-entropy losses of object labels and predicate labels.
The Total Direct Effect (TDE) method directly separates the bias from existing models without training additional layers to model the bias. Please take a look at this post for Eliminating Bias from Scene Graph Generation on the details of this unbiased scene graph generation model.
Below is a command for running the pre-trained model we choose (MOTIFS, SUM, and TDE). If you want to try other models, update the corresponding arguments. This command runs inference and saves custom_data_info.json and custom_prediction.json files for visualization and compression steps later.
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \
--master_port 10027 --nproc_per_node=1 tools/relation_test_net.py \
--config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" \
MODEL.ROI_RELATION_HEAD.USE_GT_BOX False \
MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False \
MODEL.ROI_RELATION_HEAD.PREDICTOR CausalAnalysisPredictor \
MODEL.ROI_RELATION_HEAD.CAUSAL.EFFECT_TYPE TDE \
MODEL.ROI_RELATION_HEAD.CAUSAL.FUSION_TYPE sum \
MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER motifs \
TEST.IMS_PER_BATCH 1 \
DTYPE "float16" \
GLOVE_DIR /content/drive/MyDrive/cs224w/glove \
MODEL.PRETRAINED_DETECTOR_CKPT /content/drive/MyDrive/cs224w/checkpoints/upload_causal_motif_sgdet\OUTPUT_DIR /content/drive/MyDrive/cs224w/checkpoints/upload_causal_motif_sgdet\TEST.CUSTUM_EVAL True \TEST.CUSTUM_PATH /content/drive/MyDrive/cs224w/ActionGenome/dataset/ag/frames_bicls \
DETECTED_SGG_DIR /content/drive/MyDrive/cs224w/ActionGenome/dataset/ag/anno_frames_bicls
3.2. Visualization of Generated Scene Graphs
The pre-trained model gives accurate results on bounding boxes, object detections, and relationship predictions. It can precisely detect person, objects, and their pairwise relationships in indoor settings.
We defined a visualize_scene_graph method, where image_idx ranges from 0 to 499 to select an image from 500 frames, box_topk selects the number of bounding boxes, and rel_topk selects the number of relationships. This method shows the results of both object detection (bounding boxes and object labels) and edge predictions (relationship labels), i.e., a scene graph.
visualize_scene_graph(image_idx, box_topk, rel_topk)
Figure 9 shows an example of the generated scene graph for a frame with label 1 for action “using the phone”. All of the box and relationship labels are precise, and the bounding boxes for smaller objects have accurate borders (e.g., ear, phone, etc). This example shows that the model performs well on people and key objects, so it can be applied to custom datasets for human-object interactions.
Relationship labels are directed edges, as shown in figure 10. The “phone in hand” relationship label has the highest score and is an important edge prediction for future compression and classification steps.
4. Graph Compression
Now, we have converted 400 frames to scene graph structures with node and edge labels. Next, we can implement our training and testing pipelines for graph compression using PyG.
4.1 Model Definition
The Multi-kernel Inductive Attention Graph Autoencoder (MIAGAE) model consists of an encoder network E and a decoder network D. Instead of compressing nodes/edges separately, the MIAGAE model utilizes the node similarity and graph structure to compress all nodes and edges as a whole.
4.2 Encoder Network E
The encoder uses multi-kernel inductive graph convolution (MI-Conv) and similarity attention graph pooling (SimAGPool) layers.
(1) MI-Conv uses multiple kernels, each with its corresponding transformation weights W1 and W2, which is defined by the SGAT layer (class SGAT) in our Colab. We use formulas below for features extracted by the first kernel and final features aggregating all m kernels, where σ is a non-linear activation function (we use ReLU) and Aggre is an aggregate function combining multi-kernel results (we use addition).
For each MI-Conv layer, we initialize a gat_list
to store multiple SAGEAttn layers, which are defined by class SAGEAttn in our Colab. SAGEAttn is a single-kernel Inductive-Conv layer and inspired by GraphSAGE in Inductive Representation Learning on Large Graphs (2018). The forward process is performed as:
(2) SimAGPool selects the most representative nodes with the most information by using the similarity and topology among nodes, which is defined by the Pooling layer (class Pooling) in our Colab. It achieves downsampling on graph data and adaptively selects a subset of nodes/edges to form a new but smaller graph.
4.3 Decoder Network D
The decoder uses inductive un-pooling layers, which is defined by the SGConv layer (class SGConv) in our Colab.
SGConv is a symmetric implementation for the decoder as SGAT for the encoder. The inductive Un-pooling layer has the same parameter settings as the Inductive-Conv layer in E.
The number of SGConv layers is also the same as the number of SimAGPool layers in E. SGConv has two inputs: (1) the output graph
from the previous layer and (2) the edges information
of new nodes, which is the same as the eliminated edges in the corresponding SimAGPool layer in E.
4.4 Training/Testing Pipeline for Graph Compression
We used the following hyperparameters and run the command below for training the compression model.
- batch size: 40
- number of epochs: 300
- learning rate: 1e-4
- number of samples for train set: 280
- number of samples for test set: 120
- compression rate: 0.85
python train_compression.py \
--d ag \
--batch 40 \
--e 300 \
--lr 1e-4 \
--n_train 280 \
--n_test 120 \
--c_rate 0.85 \
--model_dir '/content/drive/MyDrive/cs224w/Graph_AE/data/model'
With a compression rate set to 0.85, the model has fairly low training and test losses around 0.35. This suggests the graph compression model has a great encoder and decoder, so it can effectively compress the original graph into a smaller graph structure while not losing much of the original graph data.
5. Action Classification
Lastly, to evaluate the performance of our models, we apply a GCN classification model to output the final classification result. Our model takes the scene graphs after compression and classifies the action as either “using a phone” or “not using a phone.”
5.1 Training/Testing Pipeline for Action Classification
The classifier model has 2 graph convolutional layers with node feature dimension 64, followed by 4 layers of MLP with batch normalization.
We use the same hyperparameters as step 4.4 and run the command below for training the classifier.
python train_classifier.py \
--d ag \
--n_skip 0 \
--batch 40 \
--e 300 \
--lr 1e-4 \
--n_train 280 \
--n_test 120 \
--c_rate 0.85 \
--model_dir '/content/drive/MyDrive/cs224w/Graph_AE/data/model'
With a compression rate set to 0.85, the final train accuracy is around 0.60 and test accuracy is around 0.56. This suggests our classifier is performing fairly better than randomly guessing the actions of 400 scene graphs.
Analysis and Limitations
The graph-based downstream tasks (i.e., action classification) on these graph structures can be further improved if we are given more time and additional computational resources besides Google Colab.
We expect the accuracies to be higher if we can run our models on the full Action Genome dataset (265K frames across all 157 action categories) because we have already obtained great results using a small subset (400 frames) to classify a binary label (i.e., the “using the phone” action). We look forward to further improving our classification performance if more dataset is used for training/testing.
More importantly, it is a fairly difficult task to classify an action only through a single frame. One possible solution can be utilizing a sequence of continuous frames in a video and synthesis the temporal/spatial information together to do a high-level action classification. There are also many other graph ML techniques that can be rather helpful, such as object tracking and dynamic scene graphs.
For the purpose of this tutorial, we want to focus on how to format customized image datasets into graph structures (i.e., scene graph generation) and demonstrate how to do graph compression regarding such complex real-life digital graphs.
Conclusion
We have covered a detailed, step-by-step tutorial on using graph ML techniques and PyG to perform scene graph generation, graph compression, and action classification tasks on 400 labeled frames of the Action Genome dataset.
Now, you have learned to (1) represent any custom image dataset as a scene graph structure (with node and edge labels), (2) compress a complicated graph into a simpler structure, and (3) classify a graph by a particular label.
You can apply what you learned from this tutorial to various domains and answer challenging questions involving structured data with graph ML.