Towards Declarative Visual Reasoning. . . Or Not?
Exploring how SingularityNET will advance the research and applicability of tasks that require reasoning and pattern recognition.
Compositional Visual Reasoning
It is an undeniable fact that Deep Neural Networks (DNN) have shown impressive results in various hard to solve problems, especially when it comes to image analysis and pattern recognition.
There still are many tasks, however, for which Deep Neural Networks are not well suited. These are tasks with compositional or combinatorial nature, in particular, if we suppose exponentially large spaces of objects. Roughly speaking, these are tasks that require “reasoning.” Although attempts are being made to extend DNN models, other approaches perform better in solving such tasks — for example, one such prominent domain is algorithmic induction.
Furthermore, there are some tasks which require both reasoning and pattern recognition. One example of such tasks is Visual Question Answering (VQA), in which humans still outperform DNN solutions by a large margin. For instance, the state-of-the-art DNN models score less than 70% on the standard COCO VQA benchmark, while the human score stands at more than 83%. Moreover, such DNN models are heavily criticised for exploiting dataset biases and directly mapping inputs to outputs instead of explicitly modeling the underlying reasoning processes.
The CLEVR dataset is specifically designed to facilitate compositional reasoning. The dataset contains arbitrarily nested questions like “What size is the sphere that is left of the green metal thing that is left of the small cylinder?” although for a simplistic visual domain. Such questions cannot be put into a fixed-size embedding as is usually done in COCO VQA models. Consequently, the state-of-the-art models for this benchmark are the neural module networks (for details, see the paper mentioned earlier). Some of these models are even designed to be transparent and interpretable.
However, these models are based on a seq2seq mapping of questions to programs composed of a sequence of neural modules with partially hard-coded behavior. For example, the union
block in the example below merges two attention maps by taking an element-wise maximum among them. The program itself looks like:
00 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}
01 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}
02 = {‘inputs’: [0], ‘function’: ‘filter_size’, ‘value_inputs’: [‘small’]}
03 = {‘inputs’: [2], ‘function’: ‘filter_color’, ‘value_inputs’: [‘blue’]}
04 = {‘inputs’: [1], ‘function’: ‘filter_size’, ‘value_inputs’: [‘large’]}
05 = {‘inputs’: [4], ‘function’: ‘filter_color’, ‘value_inputs’: [‘purple’]}
06 = {‘inputs’: [5], ‘function’: ‘filter_material’, ‘value_inputs’: [‘metal’]}
07 = {‘inputs’: [3, 6], ‘function’: ‘union’, ‘value_inputs’: []}
08 = {‘inputs’: [7], ‘function’: ‘filter_shape’, ‘value_inputs’: [‘cube’]}
09 = {‘inputs’: [8], ‘function’: ‘count’, ‘value_inputs’: []}
The filter modules in the example above accept the attention maps produced by previous modules, which are used to mask an input image (represented by its high-level ResNet-101 features) and produce new attention maps. At the end of the program, there is typically a query module (which also takes a feature and attention maps as input, but produces a new feature map as output), followed by a classifier.
The programs are run step-by-step by the execution engine written in Python. This, in essence, is an imperative domain-specific language in which a simple model translates questions without any use of explicit knowledge about visual concepts and their relations (while even mere classifiers can benefit from the use of knowledge graphs). Does this correspond to the “underlying reasoning processes”?
Therefore, it is tempting to replace the imperative-style domain-specific execution engine with a declarative knowledge-base general reasoning system — making it even more transparent and interpretable.
Declarativization of Visual Reasoning
As we discussed in one of our previous posts, cognitive architectures can be used to integrate the DNN image analysis capabilities with general-purpose reasoning engines over knowledge bases to enable cognitive VQA.
For simple questions, the imperative question-answering programs can be easily converted to declarative queries.
Consider, for instance, the question “What color is the cylinder?” The corresponding program will look like:
00 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}
01 = {‘inputs’: [0], ‘function’: ‘filter_shape’, ‘value_inputs’: [‘cylinder’]}
02 = {‘inputs’: [1], ‘function’: ‘query_color’, ‘value_inputs’: []}
The query color module takes as input the image features masked by the attention map produced by filter_shape[cylinder]
module and outputs new feature map, which is then passed to the final classifier. The multinomial classifier calculates the probabilities of all answers.
However, we can explicitly introduce the query variable $X
, replace the query module with the corresponding filter module filter_color[$X]
, and ask the reasoning engine to find such value of $X
which will produce a non-empty final attention map.
Therefore, the question “What color is the cylinder?” can be represented declaratively using OpenCog’s Atomese language:
(AndLink
(InheritanceLink
(VariableNode “$X”)
(ConceptNode “color”))
(EvaluationLink
(GroundedPredicateNode “py:filter”)
(ConceptNode “cylinder”))
(EvaluationLink
(GroundedPredicateNode “py:filter”)
(VariableNode “$X”)))
In the example above, the predicates are grounded in deep neural networks — which can be the same modules as filter modules in neural module networks processing high-level image features (omitted here for the sake of brevity).
OpenCog’s reasoning system (Pattern Matcher or the Probabilistic Logic Networks, PLN) will try to find an appropriate grounding for the variable $X
, for which the whole conjunction is true. This is the reason why we don’t need to have both filter attribute and query attribute modules — query attribute modules extended by a classifier is an imperative way to determine an appropriate answer. The knowledge about ontological relations between concepts (e.g., “red” and “color”) is explicitly utilized through the InheritanceLink.
In one of our previous posts, we used models which deal with bounding boxes. OpenCog’s reasoning system had to find the bounding boxes and apply grounded predicates to them to satisfy the declarative query. Consequently, grounded predicates could produce just one truth value per bounding box, and truth values of logical expressions over these predicates could be easily calculated with the use of PLN.
However, most models for CLEVR deal with dense attention maps. This leads us to some interesting scenarios: should the declarative reasoning system attend each element of the dense attention maps? Should it necessarily deal with the extracted objects? Should it operate with whole attention maps as tensor truth values?
All of these solutions are possible with their own strong and weak points, and possibly, should be somehow combined. In humans, we observe that we can deliberately attend individual pixels in images, but we don’t usually consciously reason about each pixel.
For the sake of simplicity, however, let’s assume that we choose to deal with attention maps as a whole. In such a scenario, we will be able to see some additional differences between the use of module networks in imperative and declarative fashions. Consider the program for a slightly more complicated question like “What color is the large cylinder?”
The first filter: size[large]
will take the image feature maps and the default attention map (covering the whole image) and produce the attention map that highlights large objects.
The next filter: shape[cylinder]
will take the image features masked by the attention map produced by the previous filter and output a new attention map, which can then be used to mask the image features once again, and these features will be passed to the query color module — which will produce the new features fed to the final classifier. Also, note that the result for different order of filters can be different (especially if we consider two filters for attributed), which seems somewhat strange.
In contrast, a less imperative form of this program would be the same as above — but with the additional conjunct (EvaluationLink (GroundedPredicateNode “py:filter”) (ConceptNode “large”))
. That is, the attention maps would be produced by all filters independently and then And’ed (which can be done pixel-wise).
The query attribute modules can appear in the middle of the program and not at the end. For example, in the case of a question: “Is the small gray object made of the same material as the big cube? There will be two query material modules executed after the consequent filtering of small+gray and large+cube. They will produce two feature maps, which will then be passed to the comparison module:
00 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}
01 = {‘inputs’: [0], ‘function’: ‘filter_size’, ‘value_inputs’: [‘small’]}
02 = {‘inputs’: [1], ‘function’: ‘filter_color’, ‘value_inputs’: [‘gray’]}
03 = {‘inputs’: [2], ‘function’: ‘query_material’, ‘value_inputs’: []}
04 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}
05 = {‘inputs’: [4], ‘function’: ‘filter_size’, ‘value_inputs’: [‘large’]}
06 = {‘inputs’: [5], ‘function’: ‘filter_shape’, ‘value_inputs’: [‘cube’]}
07 = {‘inputs’: [6], ‘function’: ‘query_material’, ‘value_inputs’: []}
08 = {‘inputs’: [3, 7], ‘function’: ‘equal_material’, ‘value_inputs’: []}
This program can be converted into a declarative form. For example:
(AndLink
(Inheritance (Variable “$X”) (Concept “material”))
(Inheritance (Variable “$Y”) (Concept “material”))
(EqualLink $X $Y)
(AndLink
(Evaluation (GroundedPredicate “py:filter”) (Concept “small”))
(Evaluation (GroundedPredicate “py:filter”) (Concept “gray”))
(Evaluation (GroundedPredicate “py:filter”) (Variable “$X”)))
(AndLink
(Evaluation (GroundedPredicate “py:filter”) (Concept “large”))
(Evaluation (GroundedPredicate “py:filter”) (Concept “cube”))
(Evaluation (GroundedPredicate “py:filter”) (Variable “$Y”))))
Thanks to the absence of the necessity to pass attention maps from filter to filter, this expression can be simplified with the use of only one variable.
Here, internal AndLinks
deal with tensor truth values, while external AndLink
is a traditional PLN conjunction, so EqualLink
will compare concepts — not feature maps as done by the equal_material module.
Thus, this is not only a computationally different representation of the question but also its different formalization. A question about the similarity of colors can imply either a positive or negative answer for green objects, with a different shade, and humans can answer this question by focusing on both visual features and names of colors.
Is a declarative formalization of questions more natural?
In terms of what we want to find, this way is definitely better. However, it might not be better in describing how do humans actually find the answer.
For instance, consider the question “What color is the tiny matte block left of the blue block?”
In the neural module networks, the notion of “left” is represented as a neural network that also accepts feature maps masked by the attention output of the previous module, and produces another attention map:
In the case of “large blue block” there is no real need to pass attention maps from “large” filter to “red” filter, and further to the “cube” filter, the attention maps can be calculated in parallel and then fused. However, in the case of “left of the blue block” it is essential to use the attention map produced by “blue block” in “left of” and to pass the produced attention map further — which is definitely not a pixel-wise operation over attention maps as truth values, and which looks really like a sequential execution of imperative programs.
Does it mean that this question cannot be represented declaratively? Apparently, it does not.
“Left of” and similar relation modules act as functions returning new objects (selected pixels), but not as predicates (uniformly applied to the given set of objects). In logic, “left of” and other relations are usually formalized as predicates with two arguments. However, this predicate will also have to accept two attention maps (or feature maps masked by attention maps). It is not really a problem in OpenCog, because grounded predicate nodes (or rather, schema nodes, which are similar to grounded predicates — they both accept Atoms as input and produce them as output instead of producing truth values as output) can be nested, for example:
(EvaluationLink
(GroundedPredicateNode “py:relation”)
(ConceptNode “left”)
(ExecutionOutputLink
(GroundedSchemaNode “py:filter”)
(ConceptNode “cube”))
(ExecutionOutputLink
(GroundedSchemaNode “py:filter”)
(ConceptNode “cylinder”)))
Which will be evaluated to true if there is a cube left of a cylinder. Isn’t it an imperative program?
Apparently, there is some piece of procedural knowledge (which is still introspectable except DNNs though), but it is really more declarative. Of course, instead of some concrete concept, there can be a variable, which will be grounded by a general reasoning system not specifically designed for this task.
Furthermore, we can have a general knowledge that if X is left to Y, then Y is right to X — so we can infer one specific fact from another. In fact, in the case of the question “Is the red cube left of the green pyramid?” one can first attend the red cube and then examine the image to the right of it. However, this example also shows that “left” is not used here as a predicate of two arguments, but rather as an instruction to examine a part of an image. At the same time, humans can first independently find two objects and then check their relative location.
Interestingly, the use of functional mappings instead of predicates to represent relations corresponds to Skolemization in logic. Indeed, if we want to prove that for every X there exists such Y that p(X, Y) holds, then we can instead prove that p(X, f(X)) holds for every X. Instead of searching for Y for every given X using a general reasoning engine, we compute it directly. Like our cognitive system can know that left(X,Y) :– right(Y,X), it can also know that Y=f(X) :– left(X,Y), and use this knowledge to efficiently satisfy the query containing “left” as a predicate of two arguments (because it will know that in order to prove left(X,Y), it is enough to calculate Y=f(X)). Moreover, the cognitive system can have both “left” as a grounded predicate and “f” as a grounded schema, so, in principle, it will be able to follow any way of reasoning like humans.
Similarly, the classifier y=f(x) that outputs class labels y for given patterns x can be considered as a Skolem function for some predicate p(x, y), in which truth value can be calculated for a pair of x and y. Here, one can also find an interesting connection with the relation between probabilistic discriminative and generative models. A reasoning system, which has knowledge that y=f(x) :– p(x, y) or that the discriminative model approximates posterior probabilities specified by the generative model, can decide to use the classifier even if the query is in the form of a declarative query with variables to be grounded.
What is important is that OpenCog has trainable inference control capabilities, as highlighted in this post, so depending on the circumstances, it can potentially learn to choose a better line of reasoning.
Of course, the tricky part is to integrate all these capabilities together in one end-to-end trainable system that doesn’t require too much specific hand-coded knowledge.
The SingularityNET platform will be able to facilitate this research by bringing together AI services and agents of different nature including deep neural networks, real-world knowledge bases, and general reasoning systems.
How Can You Get Involved?
If you would like to learn more about SingularityNET, we have a passionate and talented community which you can connect with by visiting our Community Forum. Feel free to say hello and to introduce yourself here. We are proud of our developers and researchers that are actively publishing their research for the benefit of the community; you can read the research here.
For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed about all of our developments.