Introducing 🧱 CRAFT — A Benchmark for Causal Reasoning About Forces and inTeractions

Tayfun Ates
HUCVL Stories
Published in
9 min readDec 8, 2020
Marble Experiment by Kaplamino on YouTube.

Watch the above video for a couple of seconds… Although some of the actions occurred in this video are complex to grasp before happening, some of them can be predicted by us because of our intelligence in understanding the environment and reasoning causal relationships of the events realized by some objects in the scene regardless of our experience for the video. We can estimate how high a marble jumps after it collides with a static ground, or we can estimate how far a marble goes after hit by another, even though we do not know the exact values of dynamic and static object properties. In other words, we do not need accurate information about force dynamics between the objects for our estimations. Our predictions are all based on the approximate guesses about the environment we are observing at the moment. They are not exact calculations based on Newtonian physics! This ability of humans, understanding and making approximate predictions about an environment, consisting of different objects and their interactions, is known as intuitive physics (Kubricht et al., 2017).

Although recent advances in machine learning systems have reduced the gap between humans and computers, there are some research areas in which artificial systems are not performing as good as humans. The ability of understanding physical actions and reasoning causal relationships can be considered one of these areas that should be studied more by artificial intelligence scientists to be able to close this gap. In the last couple of years, a new research direction has emerged to bring similar capabilities to robots by employing recent discoveries in cognitive science and machine learning. Improving physical reasoning skills, for example by teaching counterfactual situations, can make them understand and estimate what will happen if they perform an action to change the scene without actually performing the it. Jenga playing robot is one of the recent examples in this direction (Fazeli et al., 2019).

In recent years, AI researchers gained interest to create models which have reasoning capabilities about intuitive physics. While Mottaghi et al. (2016) studied the problem of predicting whether a configuration of 3D objects have a stable configuration or not, Lerer et al. (2016) tried to estimate where objects will fall if the configuration is not stable. Moreover, Janner et al. (2019) developed a model of building a stack configuration by placing object one by one implementing a planning algorithm. Mottaghi et al. (2016) tried to understand motion trajectory of the query objects under certain forces from static images. Very recently, Bakhtin et al. (2019) and Allen et al, (2020) created PHYRE and Tools benchmarks, respectively, consisting of different 2D environments. The model must reason about the environment and select a specific action by estimating if it solves the task associated with the environment or not.

In 🧱CRAFT, similar to PHYRE and Tools, we have different 2D environments consisting of static and dynamics objects. However, we also integrate the language component, which is missing in PHYRE and Tools, by proposing our benchmark as a visual question answering task. Our dataset contains virtually generated videos with physical interactions of objects and accompanying questions testing strong reasoning capabilities. Answering CRAFT questions requires detecting objects, understanding and tracking the relations between objects; which can be attributed to causing, enabling, or preventing certain types of events, and lastly, estimating what would have happened if a change occurred in the environment. There are other visual question answering benchmarks testing physical reasonings (Wagner et al., 2018 and Yi et al., 2020), but they lack visual variations unlike PHYRE and Tools. In that sense, CRAFT combines the best of both worlds.

The 🧱CRAFT Dataset

In our recent paper, which will be presented at 2nd Shared Visual Representations in Human & Machine Intelligence (SVRHM) Workshop of NIPS 2020 conference, a new visual question answering benchmark for Causal Reasoning About Forces and inTeractions, named CRAFT is introduced. Our first version of CRAFT includes 38K video and question pairs that are generated from 3K videos, automatically.

Example CRAFT questions from a sample scene. There are 10 different scene layouts and 65 different question types which are divided into five distinct categories. Besides having tasks questioning descriptive attributes that possibly require temporal reasoning, CRAFT proposes new challenges including more complex tasks that need single or multiple counterfactual analysis, or understanding object intentions for deep causal reasoning.

More examples from our benchmark can be found at our project website.

Video Generation: CRAFT visuals are generated using Box2D (Catto, 2010) physics engine. We created 10 second videos for 10 distinct scene layouts. The resolution of our videos is 256 by 256.

Variations within each scene layout. While generating the videos, the attributes of the static scene elements are sampled from a uniform distribution. Opaque illustrations demonstrate the mean positions whereas transparent drawings show the extreme cases.

Objects: Static (ramp, platform, basket, left wall, right wall, ground)and dynamic objects (cube, triangle, circle) are the two main types of objects in CRAFT environments. While the set of dynamic objects has 8 colors (gray, red, blur, green, brown, purple, cyan, yellow), we use black color to draw static objects. Furthermore, CRAFT uses 2 object sizes (small, large).

Events: We detect different types of events from our simulations automatically. These events are Start, End, Collision, Touch Start, Touch End, and In Basket. Our scenes contains a single basket (aka. container). is In Basket is triggered when an object enters it. Although our tasks only question collision event, CRAFT wants models to differentiate the difference between collision and touching. CRAFT dataset also contains causal graphs of events inside a simulation for our question generator to extract causal relationships automatically.

Question Generation: Our question generator brings dynamism to CLEVR by Johnson et al. (2017) in order to be able to create questions for video inputs. For each CRAFT task, there is a ground-truth function program that is sufficient to answer to the question at hand. These tasks can be categorized into 5 different categories which are descriptive, counterfactual, cause, enable, and prevent.

Descriptive Questions: CRAFT’s descriptive tasks require predicting object attributes, counting the number of objects which satisfy certain conditions. Temporal analysis is also needed to give correct answers to some of descriptive questions. As well as understanding the relationships between dynamic objects, models must infer the relationships between static objects and dynamic objects in 10 different layouts.

  • How many objects enter the basket before the <size> <color> <shape> enters the basket?
  • How many objects fall to the ground?
  • After falling to the ground, does the <size> <color> <shape> collide with other objects?

Counterfactual Questions: CRAFT’s counterfactual tasks require understanding what would happen if an object is removed from the scene. Moreover, some counterfactual tasks require making this analysis multiple times removing multiple objects. Models must investigate counterfactual video(s) by predicting them somehow.

  • Does the <size-2> <color-2> <shape-2> enter the basket, if the <size> <color> <shape> is removed?
  • Does the <size> <color> <shape> fall to the ground, if any other single one of the objects is removed?

Cause, Enable, and Prevent Questions: CRAFT extends the question categories above, inspired by the theories of Cognitive Science to model how humans learn, experience, and perform physical reasoning for the events occurring in an environment. Some of these theories are mental model theory (Khemlani et al., 2014), causal model theory (Sloman et al., 2009), and force dynamics theory (Wolff and Barbey, 2015), whose main aim is to represent causal relationships, such as cause, enable, and prevent between two entities. While the first entity is called as the affector, the entity that the affector acts on is called as the patient. To our knowledge, CRAFT is the first benchmark which integrates these complex relations in a visual question answering setting to help machine learning models to get close to the human intelligence.

Besides investigating the counterfactual video(s), model must analyze the original video in order to predict the correct answer for cause, enable, and prevent questions in CRAFT. The affector and the patient objects are explicitly specified inside CRAFT’s questions’ texts. Cause, enable, and prevent questions requiring counting, there are single affectors, but multiple patients for the same question and video pair.

  • Does the <size> <color> <shape> enable the <size-2> <color-2> <shape-2> to fall to the ground?
  • How many objects does the <size> <color> <shape> prevent from colliding with the basket?

Results

So far, we have conducted experiments using some simple baseline models. Our first baseline is the Most Frequent Answer model (MFA) which extracts the most frequent answer value in the dataset’s training split and then outputs it for all questions encountered. Our seconds baseline is the Answer Type-Based Most Frequent Answer model which extracts the most frequent answer value for each answer types such as color, shape, boolean in the same split. It then outputs the corresponding value for all questions. Moreover, our third baseline is an image-blind LSTM model which is trained with CRAFT’s textual data only. This model is implemented using famous Long Short-term Memory (Hochreiter and Schmidhuber, 1997). Our last artificial baseline is LSTM-CNN model which extract natural language features with a Long Short-term Memory and visual features with Residual Networks (He et al., 2016). Types of visual features we have experimented so far are the first frame features and the last frame features. This model concatenates the visual and textual features and provide an answer accordingly.

Besides generating some artificial baselines, we also conducted a small experiment with human subjects. We sampled 522 question and video pairs and tested 12 adults. In the experiment, they watched a video and provided an answer for a question and continued with the next question. We obtained 489 question and video pairs answered by the participants.

The results of the artificial models and the human subjects are provided below:

Performance of all baseline models on CRAFT training, validation and test splits. We report the average accuracy. C, CF, D, E and P columns stand for Cause, Counterfactual, Descriptive, Enable and Prevent tasks, respectively.

Although the artificial ones are very simple baselines, there is still a large gap between those and humans. We are now very excited to be working on different state-of-the-art models in order to close this gap as a future work. We also plan to scale up our experiment with human subjects to receive more accurate data and to extend CRAFT in different directions. We welcome any discussions and possible collaborations to broaden the impact of our research project.

References

James R. Kubricht, Keith J. Holyoak, and Hongjing Lu. Intuitive physics: Current research and controversies. Trends Cogn. Sci., 21(10):749–759, 2017.

Nima Fazeli, Miquel Oller, Jiajun Wu, Zheng Wu, Joshua B Tenenbaum, and Alberto Rodriguez. See,feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics, 4(26), 2019.

Roozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonian scene understanding: Unfolding the dynamics of objects in static images. In CVPR, pages 3521–3529,2016.

Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. ICML, 2016.

Michael Janner, Sergey Levine, William T. Freeman, Joshua B. Tenenbaum, Chelsea Finn, and Jiajun Wu. Reasoning about physical interactions with object-oriented prediction and planning. In ICLR, 2019.

Roozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonian scene understanding: Unfolding the dynamics of objects in static images. In CVPR, pages 3521–3529, 2016.

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning. In NeurIPS, pages 5082–5093, 2019.

Smith Kevin, A. Allen, Kelsey R. and Joshua B. Tenenbaum. Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning.arXiv preprint arXiv:1907.09620,2020.

Misha Wagner, Hector Basevi, Rakshith Shetty, Wenbin Li, Mateusz Malinowski, Mario Fritz, and Ales Leonardis. Answering visual what-if questions: From actions to predicted scene descriptions.In ECCV Workshops, 2018.

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. InICLR, 2020.

Erin Catto. Box2d v2.0.1 user manual. 2010.

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InCVPR, pages 2901–2910, 2017.

Sangeet S. Khemlani, Aron K. Barbey, and Philip N. Johnson-Laird. Causal reasoning with mental models.Front. Hum. Neurosci., 8:849, 2014.

Steven Sloman, Aron K. Barbey, and Jared M. Hotaling. A causal model theory of the meaning of cause, enable, and prevent.Cognitive Science, 33(1):21–50, 2009

Phillip Wolff and Aron K. Barbey. Causal reasoning with forces.Front. Hum. Neurosci., 9:1, 2015.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, 1997.

--

--