Graph Policy Gradients for Large Scale Robot Control
In the recent past, deep learning and deep reinforcement learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks. These include, but are not limited to, learning to play video games, learning complex control policies for robot tasks. While these works are impressive, they often concern themselves with controlling a single entity. However, in many real-world applications, especially in fields like robotics, there exists a need to control multiple robots interacting with each other in co-operative or competitive settings. In such a scenario, as the number of robots increases, the dimensionality of the input space and the control space both increase making it much harder to learn meaningful policies.
Another key problem with attempting to use reinforcement learning for multi-robot problems is the increase in entropy of the system. In reinforcement learning, the robot/agent attempts to learn how to execute a task by executing the same task repeatedly. For single-agent systems such as an agent learning to play a video game, one might need 5–10 million examples to learn meaningful behavior. When the number of agents is increased, the randomness of the system increases and learning meaningful behavior for a large number of robots is almost computationally intractable.
To solve these problems for large scale multi-robot control, we draw inspiration from convolutional neural networks (CNNs). CNNs consist of sequentially composed layers where each layer comprises of banks of linear time-invariant filters to extract local features along with pooling operations to reduce the dimensionality. However, CNNs cannot be directly applied to irregular data elements that have arbitrary pairwise relationships defined by an underlying graph. To overcome this limitation, there has been the advent of a new architecture called graph convolutional networks (GCNs). Similar to CNNs, GCNs consist of sequentially composed layers that regularize the linear transform in each layer to be a graph convolution with a bank of graph filters and the weights of the filter are learned by minimizing some cost function.
Thus, the idea behind this paper is simple. We use graph neural networks to learn filters for a small number of robots. So for example, when doing formation flying, we only train with 3 or 5 robots. The GCN learns filters that extract information at this low level. Each layer of a GCN aggregates information from one-hop neighbors. So for example, a two-layer GCN aggregates information at each node from two hops. This can be visualized here :
During inference, we scale up to many robots (100+) by simply reusing the same filter weights at every robot node. This algorithm has the additional advantage of being decentralized in nature when executing since every robot has its own copy of the filter weights and only relies on information from its local neighbors to execute its trajectory. The performance of our algorithm is captured in the following video :
https://www.youtube.com/watch?v=RefiX9UCCw8
For more details please refer to our full paper which can be found at