Eliminating Bias from Scene Graph Generation

Saketh Vishnubhatla
OffNote Labs
Published in
7 min readAug 8, 2020

Generating Scene Graphs, the problem of Bad Bias and Causality

A typical Scene Graph generated from an image

Visual-Question-Answering (VQA) is one of the key areas of research in computer vision community. To perform VQA efficiently, we need information about all the objects in the image and how these objects are related to each other. This is where Scene Graphs (SGs) come in.

Another key area where SGs help are in generating captions for images or videos. One advantage of using SGs for caption generation is that the generated captions are more likely to be grounded in the image (or video). In other words, since each node in a scene graph is an object from the image, the generated caption from scene graphs is more likely to contain objects from the image.

What is a Scene Graph (SG)?

A SG is a directed graph with each node representing an object from the image and each edge representing the relationship between the two objects.

Now, how do we generate a SG?
* Detect the objects in the image
* Guess the relationships between the detected objects.

Note how most of the objects are grounded in the image.

Why is generating good SGs hard?

The goto dataset for SGs has been Visual Genome Dataset. Here is a sample from the dataset, showing how a scene graph is actually stored in the JSON format.

[...
{
"image_id": 2407890,
"objects": [...
{
"object_id": 1023838,
"x": 324,
"y": 320,
"w": 142,
"h": 255,
"name": "cat",
"synsets": ["cat.n.01"]
},
{
"object_id": 5071,
"x": 359,
"y": 362,
"w": 72,
"h": 81,
"name": "table",
"synsets": ["table.n.01"]
},
...],
"relationships": [...
{
"relationship_id": 15947,
"predicate": "wears",
"synsets": ["wear.v.01"],
"subject_id": 1023838,
"object_id": 5071,
}
...]
},
...]

For each image, a list of objects in the image are stored. A list of relationships corresponding to each image are also stored: each relationship contains a subject and an object with a relation predicate. Here is an example of a relationship:

“Motorcycle on street” — Motorcycle (subject), on(predicate), street(object)

One typical problem encountered is that most of the relation predicates in such datasets are mostly prepositional. So SG models, which are usually trained on such datasets learn to predict “car on street” rather than “car parked on street”. Prepositional relationships are trivial: we would like to guess more intricate relationships.

A bar graph denoting the frequency of predicates sampled from visual genome dataset. Most frequent predicates happen to be prepositional and finer relationships appear at the tail end.

Unbiased Scene Graph Generation

Input Image
(Left) A biased scene graph. (Right) An unbiased scene graph generated by the proposed unbiased prediction from the same model.

In the paper titled “Unbiased Scene Graph Generation from Biased Training” by Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang, a framework is proposed to remove bad bias from the predictions. What is good and bad bias? There is the good bias i.e “person reads book” rather than “person eats book” and the bad bias (or long-tailed bias): on vs “parked on/standing on”. The problem with traditional debiasing techniques such as upsampling or using elaborately designed losses is that it cannot distinguish between these two biases. The paper discusses an approach based on causal inference to eliminate this bad bias, which can be adopted and used on top of many existing Scene Graph Generation(SGG) models (MOTIFS, VCTree).

How do we eliminate bad bias ?

Humans understand an image by looking at both at the content and the context. In the backdrop of SGG models, content refers to the visual features of subject and object, each individually, while context refers to visual features of subject-object union region and pairwise object classes(labels). What we would want the model to do is to guess a relationship between subject and object by focusing on the content, but not the context. Why is that??

Here is a nice illustration from the paper to give you a broad idea:

As you can see, in the above image (top) a typical scene graph model after training, outputs “on” as the most probable predicate. One reason for this could be that the model has seen many instances of “dog” and “on” together in the dataset and learned to predict “on”. But what if we focused on dog’s visual features, it’s straight legs in the picture. This increases the chance of the model predicting “standing on”. But how would we teach the model to look at only content?

(Biased output based on Content + Context) - (Output based on context only) = Outputs based on content

To see the effect of only context on the predicates, one can wipe out the visual features of object and subject from the image. This gives an idea how context impacts the final predicates. Finally comparing the biased predicates with the purely context based predicates would give us the unbiased predicates capturing finer relationships. Those specific visual cues of objects are the key to more fine-grained and informative unbiased predictions. For example, even if the overall prediction is biased towards the relationship like “dog on surfboard”, the “straight legs of the dog” would cause the prediction to be biased towards “standing on” rather than “sitting on/on”.

We humans are capable of analyzing visual images based on main effect — content, rather than merely looking at side effect — context. But the models usually are likelihood based hence they cannot distinguish between this main effect and side effect.

What the paper suggests?

The paper proposes a causal model based on counter factual thinking to distinguish between content and context. By counter factual thinking one would essentially answer the following question “If I had not seen the content would I have still made the same prediction?”.

Viewing Scene Graph Generation as a Causal Graph

Each of the SGG models can be at a higher level seen as a graph with each node representing data features and each link representing data flow. Let us consider each node in more detail. Node I refers to the image and a pretrained Faster-RCNN which outputs a set of bounding boxes for each object in the image. Node X refers to features of the subject and the object that it extracts from the image, Node Z corresponding to the one-hot vectors for subject-object labels and Node Y the final predicate probabilities.

Other relevant terms in this context are:

  1. Intervention: It wipes out all incoming links of a node and forces the node to take a certain value.
representation for intervention at node X

2. Counterfactual: It means counter to the facts. Refer to the above diagram (c) where in node X is intervened but the node Z still remains the same as if X had existed as before.

Notations

A brief discussion on notation used in the paper.

Notation for predicate logits in the biased generation (fig c — top graph)

This notation is used in the paper to indicate probability of each predicate output by node Y, the set of predicate logits. u refers to the input image (Node I).

Predicate logits after intervening variable X.

The above notation is used to indicate the set of output predicate logits when variable X is intervened.

Notation for predicate logits generated upon

This is the set of predicate logits upon intervention on X , but counterfactuality imposed on node Z.

Total Direct Effect

The key idea is that likelihood tends to be biased, hence the unbiased prediction lies in the difference between the observed outcome — original biased (Fig (b)) — and the counterfactual alternate which is a context specific bias which we would like to remove.

So the final output predicate logits can be calculated as follows:

Final unbiased predicate logits

In Summary

The advantage of this paper over other debiasing methods is that it helps the model distinguish between good and bad bias, and doesn’t require any additional training layers. It can be used on top of many scene graph models. Distinguishing between biases helps in overcoming the problems associated with fewer action predicates in most of the scene graph datasets and predict more fine-grained predicates as opposed to trivial ones.

Saketh is a participating in the short-term research program at OffNote Labs. He is working on the problems of Image and Dense Video Captioning.

Connect with him on LinkedIn!

--

--