Combining Reinforcement Learning with Neuro-Symbolic Planning

Ulzhalgas Rakhman
7 min readDec 18, 2022

--

This work was implemented for the final project in Deep Reinforcement Learning (AI 611) class, KAIST.

Deep Symbolic Reinforcement Learning Architecture

Welcome to my blog for all of you who are interested in Reinforcement Learning. In this blog, I will elaborate on Deep Symbolic Reinforcement Learning Framework on realistic game environment (e.g., Sokoban).

Classic Symbolic AI

First, let’s discuss Symbolic AI planning. Symbolic AI is a branch of Artificial Intelligence, where task is determined in terms of ‘objects’ and ‘predicates.’ The objects are simply items in a world, and the predicates are relationships between them. A great illustration of Symbolic AI is a family tree (please see figure below).

Image from https://medium.com/@vbanda/good-old-fashioned-artificial-intelligence-b60800313dee

We can infer new relationships between objects by using symbolic reasoning on a list of relations. For example, Homer is the father of Lisa. A huge benefit of using Symbolic AI is interpretibility of representation.

Nowadays symbolic AI in robotics is well-studied subject. AI researchers widely have used symbolic representation and AI planners for the various robotic tasks such as robot manipulation, robot navigation, human to robot and robot to robot interaction.

Two Popular Ways to Build Intelligent Agents

In general, reinforcement learning (RL) and symbolic planning have both been used to build intelligent autonomous agents.

The essential benefit of Symbolic AI is that the reasoning process can be easily understood. A Symbolic AI methods can explain why a certain decision is achieved and what the reasoning steps had been. Additionally, a major drawback of Symbolic AI is the laborious task of manually coding the rules and information needed for the learning process.

Symbolic Planning vs. Reinforcement Learning

On the other hand, RL relies on learning from interactions with real world, which often requires an excessive amount of experience.

Why Neuro-Symbolic Representation for RL?

Although Deep Reinforcement Learning (DRL) systems provide amazing outcomes, they have a number of disadvantages:

  1. Requirement for very large training sets
  2. Poor performance on a new task
  3. Exploitation the statistical regularities present in the training data.
  4. They are opaque, making it challenging to deduce a human-comprehensible series of justifications for the decisions the system takes.

Deep reinforcement learning systems inherit from deep learning the requirement for very large training sets, necessitating the integration of symbolic reasoning and deep learning. They are fragile in the sense that a trained network that excels at one job frequently fails miserably at another, even if the latter is extremely similar to the former. DRL techniques do not fully take use of the statistical regularities seen in the training data because they do not leverage high-level processes like planning, causality reasoning, or analogical reasoning. Extraction of a human understood chain of justifications for the action choice is challenging. Contrarily, traditional AI encodes information using propositional representations that resemble language.

Towards Deep Symbolic Reinforcement Learning

The method in this paper takes some aspects from machine learning, and combine them with classical AI. It uses machine learning to learn symbolic representations, and then use symbolic reasoning on top of those learned symbols for action selection in the case of DRL.

The stages of Deep Symbolic Reinforcement Learning

The system has three stages: low-level symbol generation, representation building, and reinforcement learning.

A convolutional neural network (state autoencoder) trained on 5000 randomly generated pictures of game objects is used in the first step. In order to learn from an object’s dynamics, the second stage acquires the ability to monitor it between frames. Finally, we reach the reinforcement learning phase, where the objects are represented relative to one another based on their locations drastically condensing the state space under the presumption that items (both in the game and in the real world) tend to have little impact on each other.

There are four principles for building Deep Symbolic Reinforcement Learning (DSRL) systems:

Conceptual abstraction is first achieved by converting high-dimensional raw data into a lower-dimensional conceptual state space, and subsequently higher-level symbolic approaches are used. Second, compositional structure that allows for integrating and reintegrating elements in an open-ended way. It is impossible to expect an end-to-end reinforcement learning system to perform without any prior assumptions about the domain, thus build on top of common sense priors. Fourth, causal reasoning via the identification of the domain’s causal structure and symbolic rules that are articulated in terms of both the domain and common sense priors.

Overview of the symbolic representations extracted after each of the three main steps.

As mentioned before, given the geometric simplicity of the games, this is enough to extract the individual objects from any given frame. Then, we label persistent object to build spatio-temporal representations.

Reinforcement Learning

  • The main idea is to learn several Q functions for the different interactions and query those that are relevant for the current situation
  • The update rule for the interaction between objects of types i and j
where α is the learning rate, γ is the temporal discount factor, and each state sijt represents an interaction between object types i and j at time step t.
  • To choose the next action we add up all Q values obtained from the currently relevant Q functions at the time step and pick the one that will return the highest reward

Now that stages one and two of the system pipeline have been completed, it is possible to use them to learn an efficient policy for gameplay. We train a separate Q function for each interaction between two object categories for reinforcement learning.The key concept is to become familiar with a variety of Q functions for the various interactions and to query those that are pertinent to the current situation. We can approximate the optimal policy using tabular Q-learning given the simplicity of the game and the limited state space that emerges from the sparse symbolic representation. We add up all the Q values from the time step’s relevant Q functions to determine the next action, then choose the one that will result in the highest overall reward.

There are 3 questions regarded to the paper.

  1. Does considering the types of neighbours for object tracking impact on improvement of symbolic agent?
  2. Does this proposed method work on more complex domain (e.g., Sokoban)?
  3. Does symbolic agent outperform other pure RL methods except DQN (e.g., PPO, SAC)?

To answer to those questions, i needed to complete more experiments and can be considered as improvements.

Experimental Evaluations

(1) Replication Results

The four different game environment. The agent is represented by the “+” symbol. The static objects return positive or negative reward depending on their shape (“x” and “o” respectively).

The authors’ prototype system picks up four different games and learns how to play them. An agent moves around a square space populated by circles and crosses. It is rewarded positively for each cross it “collects” and negatively for each circle. Four variations are used: the first has only circles in a fixed grid, the second both circles and crosses in a fixed grid, the third only circles but in a random grid, and the fourth both circles and crosses in a random grid.

Comparison between DQN and symbolic approach. Average percentage of objects collected over 200 games that return positive reward in the grid environment (left) and in the random environment (middle) and replicated results (right)

In order to assess agent performance, precision and recall metrics were utilized. Precision is defined as the proportion of positive objects we collect (crosses), and recall is defined as the proportion of positive things that are available that we actually collect. The agent increases with training to a precision of 70%, according on the initial replication data. Comparing the effectiveness of DQN versus the DSRL system is fascinating. In the grid scenario, DQN performs well, but when objects are placed at random, the DQN agent finds it difficult to learn an efficient policy within 1000 epochs.

(2) Improvement Results

I considered Sokoban domain to validate the method working on more realistic task.

Sokoban game environment
Performance of Symbolic approach vs. PPO on Sokoban task

This is the results of the experiment. On the left, we can see the performance of symbolic agent using sokoban images as input. On the right, we can see the performance of PPO using sokoban display.

Conclusion

To conclude, let’s answer to the questions that we had before based on additional experiments.

  1. Does considering the types of neighbours for object tracking impact on improvement of symbolic agent? -> NO
  2. Does this proposed method work on more complex domain (e.g., Sokoban)? -> YES
  3. Does symbolic agent outperform other pure RL methods except DQN (e.g., PPO, SAC)? -> YES

Applying symbolic agent to robot manipulation remained as future work.

--

--