The Effect of Samplers on “Max of two Quadratics”-Learning

Published in

InstaDeep

18 min readMay 12, 2022

This piece gives a glimpse into a corner of the work that I got to do while working as an intern at Instadeep. The post will focus on the effect of samplers on multiagent soft Q-learning in the presence of game-theoretic pathologies. We will start by introducing the context, and then slowly explore the key concepts as well as results.

Single-agent reinforcement learning deals with a specific case of learning wherein an agent interacts with its environment to achieve a pre-defined goal. The idea is that the agent makes an observation, takes an action, then receives a signal (reward, say) defining how good/bad the action is. Multiagent reinforcement learning lifts this idea to a context where we have more than one agent in the environment, and the following picture captures the essence of this space.

From J K Terry’s “Multi-Agent Deep Reinforcement Learning in 13 Lines of Code Using PettingZoo”

Here the agents are all able to make some observations and take actions, and they receive a reward signal as before. The main difference is that there are many possible contexts here such as cooperative, competitive, or mixed games. These determine the properties of the reward function. For example, if the agents are in a cooperative setting, then the reward function will be a function of all the actions, and the agents are required to collectively optimise it. However, if the context is competitive, then the function needs to reflect the zero-sum nature of competitive games (we use the word zero-sum loosely here since we can have rewards summing to any fixed number). There is a broad mixed spectrum in-between.

In this post, we investigate the convergence of “naive” multiagent soft Q-learning (Wei, E. et al. 2018) in continuous games where learning is most likely to be affected by a game-theoretic pathology known as relative overgeneralisation — we will unpack what we mean by this in what follows. While this problem will occur more often in instances where learners are learning independently, it is also a major problem in joint-learner problems when information is not used efficiently in the attempts to solve whatever problem is at hand. This pathology leads to learners that get stuck in suboptimal equilibria, which is undesirable.

What makes the task of solving this problem hard is the fact that it speaks directly to the challenge of translating single-agent algorithms to multiple agents. There is no prescriptive way to do this because multiagent settings are varied and introduce idiosyncratic challenges, and it is often hard to find a one-size-fits-all (loosely speaking). It is worth remarking that while there are methods that solve the task in a comparable sense to the naïve approach, it is hard to find well-motivated methods that are consistent in their convergence. Given that our interest was in energy-based models, the first step was to understand the elements that make these models. This is important because while research for multiagent learning has many publications today, a lot remains to be done. This write-up focuses on one of these elements, sampling, as will be discussed below.

Unpacking what has already been said, in this research we were curious about ways in which we could improve the learning of agents in the above-mentioned context. In particular, we wanted to know if it was possible to observe improvements in results from the perspective of efficient sampling, focusing on the simpler naive model. An advantage of this is that a simpler mechanism to improve learning will allow the development of simpler methods that achieve better results. After reaching this point, it may also be possible to see if there is a way to extend this towards the broader non-stationarity problem in multiagent learning. The non-stationarity problem in multiagent reinforcement learning usually arises due to the fact that the definition of optimality is based on the joint action, which makes the dynamics non-stationary from the perspective of individual agents. One of our goals was to follow this journey towards developing simple mechanisms where coordination came out naturally in the face of the aforementioned adversity (natural in the sense of requiring only well-justifiable assumptions, where this is possible).

The plan for the remaining part is to do the following:

Define relative overgeneralisation and give examples.
Introduce entropy regularised reinforcement learning and introduce an algorithm that lifts this to the multiagent case.
Introduce reproducing kernel Hilbert spaces, paving the way for the methods studied.
Lastly, look into the experimental results and draw conclusions.

Relative Overgeneralisation and Examples

Miscoordination due to sub-optimal joint-policies results in values corresponding to optimal actions being underestimated, which is known as action shadowing. Palmer, G. et al. 2019 define relative overgeneralisation as a type of action shadowing, occurring in games where a suboptimal Nash Equilibrium yields a higher payoff on average when each selected action is paired with an arbitrary action chosen by the other player.

First, consider the following

In this very easy context, agents that use an average-based learning scheme will reason that since ∑ (A,j)<∑ (C,j) and ∑ (B,j)<∑ (C,j) then C should be preferred for any of the other agent’s actions. In this notation A, B, C are columns (rows), and j’s (which are summation variables) are the indices of entries on these columns (rows). Then both agents will move towards the shadow equilibrium (C, C). If an alternate move is played with some small probability, then the expectation is that player 1 will have a transition from C to B and so will player 2. However, they will not move any further as they will have reached the so-called Pareto-dominated sub-optimal Nash equilibrium, which essentially traps them to suboptimality.

Secondly, the following diagram is another example from Wei, E. et al. 2018 — which is our starting point — that demonstrates a simple case where this pathology can be observed in continuous coordination-based games.

In this setting, the agents will similarly tend to learn to move towards the suboptimal equilibrium, N, due to the average payoff there being higher in comparison to that corresponding to the optimal equilibrium, M. Next, we introduce entropy regularised reinforcement learning briefly.

Entropy Regularised Reinforcement Learning

First, we consider the following picture.

We can interpret a) as a context where there is a sequence of states, and for each state, an action is taken, much like the picture, we had for reinforcement learning. What’s missing for this particular diagram, however, is the notion of optimality. That is, loosely speaking, a sense of how good the actions taken are. We can introduce optimality variables O, as seen on b), which are essentially indicator functions of the optimality condition. If an action is optimal, they are set to one, and to zero otherwise. This gives a model that is essentially the same as what we had in the case of single-agent reinforcement learning. From this point, we can study reinforcement learning as inference on a graphical model, and we get an entropy regularised objective

where the p-hat denotes the learned dynamics, p is the true distribution (that we want to learn), both taken over trajectories (Levine, S. 2018). It is possible to have a more in-depth discussion on this context, but we only summarise results from (Haarnoja, T. et al. 2017):

The associated policies take the form of energy-based models, where the model is defined as the exponentiation of some (energy) function.
The associated Q-function (soft Q-function) satisfies a (soft) version of the classical Bellman equation (More information about this can be found on Sutton, R. S. and Barto, A. G’s Reinforcement Learning text — 2nd edition — on Chapter 3.) — the state-value function similarly — and these are analogues to the traditional Bellman equations.
If we make very mild assumptions about the finiteness and boundedness of these value functions, it can be shown that there is an iterative procedure that converges to optimality.

The problem, however, is that given the first point, it is hard to sample from these general policies in the learning process. It is possible to overcome this by using Stein Variational Gradient Descent, which we discuss below, to minimise the KL divergence between an easy to sample from the model that we are learning and the target. In order to generalise this to multiple agents, we follow the algorithm below (Wei, E. et al. 2018).

At this point, we would like to discuss reproducing kernel Hilbert spaces, hopefully, to orient the reader’s perspective appropriately for when we discuss the methods. Without this high-level discussion, it may be difficult to grasp the core of what this work is based on.

Towards Sampling: Introduction to Reproducing Kernel Hilbert Spaces

In what follows, we introduce reproducing kernel Hilbert spaces (RKHS), and ways to think about them as the foundation of the methods discussed here.

Concretely, we can define, from a symmetric function, a positive-definitive kernel, k, by imposing a mild positivity condition. It can be shown that if we are given a positive-definite kernel, k, then there is a Hilbert space, H, and a map, ϕ: X→ H, such that k(x,y)=⟨ ϕ(x),ϕ(y)⟩, where this inner product is taken in H. These ϕ -maps are precisely feature maps as we know them. In fact, we can go the other way around, so this is actually an equivalence of statements.

Reproducing kernel Hilbert spaces, while seemingly too abstract, are useful because (amongst other things) they give us a way to do computations that (implicitly) involve feature maps without having to go through the trouble of explicitly defining the feature maps or having to work in spaces that have a large dimension (see example on page 27).

There are three key spaces involved here: X, the space of data points, the space of images of feature maps (call this curly-F), and the reproducing kernel Hilbert space, and they are related by the following diagram.

What we see here is that when working with reproducing kernel Hilbert spaces, we are implicitly doing computations with feature maps, but we’ve found a shortcut where we work in a different space (instead of that defined by images of feature maps), and this shortcut turns out to be much more efficient. The connections are shown by arrows with associated functions as seen in the above definition.

The most compact, but rather abstract definition, essentially characterises reproducing kernel Hilbert spaces as those spaces (of functions from X→ ℝ) whose evaluation functionals are continuous. Functions in these spaces are point-wise well-defined. Loosely speaking, in context this continuity comes with boundedness, and boundedness invites, to the mind, notions of magnitudes (of points) and distance (between points). Furthermore, given a kernel, the space that we can produce is unique up to isometric isomorphisms/isometries, and given a reproducing kernel Hilbert space the kernel is unique.

All these characterisations give us tools to think about reproducing kernel Hilbert spaces, and we can go back and forth as suggested by the image below.

MIT :9.520: Statistical Learning Theory and Applications, Spring 2011.

In what follows next, we introduce the methods we looked into. The key thing here is that we are trying to model a target distribution using a finite number of samples/’particles’ — as best as we can. Doing this allows us to sample from this distribution efficiently. The first step, with these methods, is to sample essentially randomly, then after this one imagines that we are lifting these particles to a space whose points are images of feature maps, and we do modelling computations with these images. However, instead of working in this space, we work in the reproducing kernel Hilbert space, which is much easier to work on — in pretty much an equivalent sense. This is the part where the above-discussed machinery becomes relevant. The following picture is a visual example of how these methods work.

Kernelised Stein Discrepancy based Neural Stein Sampler

This is a toy example with a 2D Gaussian mixture that trained with one of the methods (Hu, T. et al. 2021). The red contours represent the target distribution and the green dots are generated samples. From left to right are the initial random distribution and then after 1000, 2000 and 5000 iterations respectively. We see that the initial samples are not very good, but they get better with the iterations. The updates performed for these steps are what we discuss below as these are the distinguishing factors between the methods (mostly). In what follows, we finally introduce the methods and algorithms.

Stein Variational Gradient Descent

This quick overview of Stein Variational Gradient Descent (SVGD) follows (Liu, Q. 2016). Let p(x) be a positive density function on a subset of the d-dimension cartesian product of the reals that we want want to approximate with a set of n particles. The initialisation of these particles is done with a simple distribution q_0, and, given an epsilon step size as well as an appropriate velocity field, ϕ, this is iteratively updated as x← x+ϵ ϕ(x). Here, we have the following :

The velocity field (or perturbation direction) is chosen to maximally decrease the KL divergence between the distribution of the updated particles and the target.
The set of perturbation directions is taken to be the unit ball of a vector-valued reproducing kernel Hilbert space, which is essentially a product space of regular reproducing kernel Hilbert spaces.
It turns out that the above objective is a linear functional of ϕ and that minimising the KL-divergence is equivalent to computing the expectation of (∇_x log p(x))’ϕ(x)+∇_x⋅ ϕ(x) over q, where ‘ has been used to denote the transpose.

The algorithm is as follows

Kernelised Stein Discrepancy

In a reproducing kernel Hilbert space of dimension d, say, associated with a kernel, k(⋅,⋅), it can be shown that if the function space on which the optimisation above is done is the unit ball of the RKHS, then we can define a good discrepancy measure between distributions by considering the supremum of the quantity that we are trying to minimise in the previous section (keeping within the unit ball still), and this gives us the so-called Kernelised Stein Discrepancy (KSD). The KSD has a closed-form KSD(p,q)=E(u_q(x,x’)) where this expectation is taken over the distribution q, and where

with S_q denoting the gradient of the log-probability as seen above. The KSD essentially measures the goodness-of-fit of n-samples to a density q(x).

The way we work here is the following. We let q(x) be an unnormalised target density with support on a subset of a d-dimensional cartesian product of the reals, and p_z(z) be a reference distribution that generates noise in d_0 dimensional cartesian product of the reals. We let G_θ be a sampler parametrised by θ. If p_θ(x) is the underlying density of samples x=G_θ(z), then we want to learn network parameters such that approximates the target density well, where this target corresponds to the density that was mentioned in the previous section, p(x) (that we want to approximate). The algorithm is given below.

Going back and Approximating Instead: Learning the Stein Discrepancy

In this section, we explore yet another method similar to SVGD. In higher dimensions, however, using kernel methods tends to be difficult, and learning Stein’s discrepancy seems like a possible way to circumvent the difficulties. So, we ask ourselves, “Can we use functional approximation to approximate this?“ And the answer is that “Yes, we can.“ To quote a known villain, “ Reality is what [we would] want it to be.“ In particular, we can use a critic f_ϕ whose parameters maximise an L-2 regularised objective of the form we saw in the SVGD section but where the kernel aspect is replaced with the learned critic. The model can then be trained to minimise the same objective minus the regularising part. This resembles the min-max framework of generative adversarial networks (Goodfellow, I. et al. 2014). The formal algorithm is as follows.

For our purposes, we shall focus on the model aspect of things. By this, I mean that we shall retain the updates for the Q-function as we had them before, using this as the critic. The reason is that the updates implicitly minimize the discrepancy.

Results and Discussion Thereof

In terms of the dynamics of the game itself, we expect the following behaviour from the agents. This is from a successful run that discovers the optimal equilibrium.

Below, we provide not this but the associated learning curves of the methods which allow us to comment on not only whether there was
convergence — which we know is the case when we achieve a desired score — but also the rate of convergence. For this reason, they are the more preferred mechanism for displaying the results.

First, we start with individual plots for the runs methods, i.e. we plot the rewards for one of the runs where the agents converged to optimality, if any, and the suboptimal reward curve, where this was obtained. We select these randomly to just give a clear picture of what individual curves looked like before we look at averaging below. The first plots for SVGD are demonstrated clearly by the following.

The following is the LSD-variant, where the reward signal is quite noisy, as we can see below.

Lastly, the results for KSD are as follows.

In the following, we did 10 separate runs, then computed the mean for the plots. The number of iterations was 35000 to give a fair chance of convergence, and to also test for stability of the mechanisms. The first method that we look into is SVGD whose results are the following.

This curve shows that the average reward went slightly above the zero reward reference, which is the suboptimal equilibrium. This ‘going over’ the reference simply means that some run achieved optimality, and as it turns out there was exactly one such run. The next result we look at is that of the learned Stein discrepancy mechanism.

Here, unfortunately, the results were not as good as we would have liked. This might have been caused by a couple of things. The first is that we did not really train the critic as detailed on the official algorithm, and we only focused on the model part. Secondly, it could just be the fact that the method is approximation-based, and need necessarily trump the other options in lower dimensions (but it should beat them by miles in higher dimensions), especially given the first point. The following is a plot showing results from the method on Kernelised Stein Discrepancy.

We see that the performance is similar to that of SVGD. However, the plots below show that the convergence was much faster on average, but this result is consistent with the theory, so the observation is one that should give us confidence.

It would seem that modified opponent-modelling-type approaches are necessary to make a significant difference in the learning from the original method used for multiagent soft Q-learning. I am interested to see what exactly the difference was in the runs that were successful in the experiments above. That is, to answer the question: What was different in the starting behaviour that led to a win? Identifying the essence of what is required for functionality would provide insights towards a reliable prescriptive mechanism for extending single-agent algorithms to multiple agents — for cooperative settings.

The key takeaway here is that while improving sampling will aid learning very often, essentially without extra cost, it will not automatically lead to better performance in cases where there are underlying game-theoretic pathologies. In such cases, we need to think about the pathology in consideration, and possibly devise a method that accounts for the kind of ‘thinking’ that is needed given the pathology. The general policies that are possible in multiagent reinforcement learning certainly benefit from the study of samplers. With the advent of scalable methods, it looks like there are promising prospects towards the direction of scalable joint policy searching, especially in the context of entropy regularised multiagent reinforcement learning.

Remarks on Further Strides to Take on This Track

This study is incomplete as it stands, but it certainly offers some direction into an interesting and useful research track. When it comes to things I would have liked to do (in this corner of the investigation) the list is as follows:
1. An experiment-based study on the usage of different kernels. This study used a Gaussian Kernel with a dynamic width.
2. Continuing the study on methods of training energy-based models, in particular using Variational Entropy Regularized Approximate maximum likelihood (Grathwohl, W. et al. 2021) as well as on samplers outside the Stein class.
3. Studying reasonable modifications to value functions that could permit learning in the desired sense. The question here that we could
then ask is: why would these be better preferred than factorising the policy itself?
4. A further study on the way in which policies are built. It seems that this is the biggest culprit when it comes to ensuring convergence to global optimum, and this is an avenue we explored later that we will write about at a later stage.
5. A study on the samplers themselves as well as possible generalisations of the methods, as a much bigger theoretical task.

Remarks on Internship and Thanks

I am beyond grateful to InstaDeep for the opportunity to not only develop my own research skills but to do work that may serve as a basis for future projects in the company. It has really been an honour and privilege to work with the Cape Town office — a team so full of ambition and zeal for what they do. I really enjoyed being in an environment that pushed for excellence in all aspects, but that also cared about my well-being and growth. The Cape Town office really opened a space for me to focus on research that has also inspired many projects that I see myself doing in the future. It has also been a privilege to get to interact with members across the different offices, exchanging ideas, and helping each other grow. Of course, I had a lot to learn, but I am grateful for the eagerness and passion that the company and its employees' harbour.

The support that InstaDeep is giving to students in the country (and continent) is certainly something that I do not take for granted at all. The company’s strides towards cultivating talent in the continent, and the freedom it gives its offices is what truly sets it apart from many others. I also think that the kind of work that emerges in systems like this is really that which allows companies to not only go with specific agendas but to also start thinking about how they can make a difference in the communities that surround them. I am very excited to have been working in a space that seeks to do the latter — i.e., that does not only function in its own bubble, but, instead, is also constantly trying to make a positive difference in society. I hope that the company will continue to give support to students (and graduates, generally) in order to pursue research, as they have done for me. I, once again, feel honoured to have been a part of such a dynamic hub of ideas and innovation.

Bibliography

Grathwohl, W., Kelly, J., Hashemi, M., Norouzi, M., Swersky, K. and Duvenaud. NO MCMC FOR ME: AMORTIZED SAMPLING FOR FAST AND STABLE TRAINING OF ENERGY-BASED MODELS (2021). Available online : arXiv:2010.04230v3 [cs.LG].

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. Generative Adversarial Networks (2014). Available online : arXiv:1406.2661v1 [stat.ML].

Grathwohl, W., Wang, K., Jacobsen, J., Duvenaud, D. and Zemel, R. Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling (2020). Available online : arXiv:2002.05616v4 [stat.ML].

Liu, Q. Stein Variational Gradient Descent: Theory and Applications (2016). Available online : link.

Palmer, G., Savani, R. and Tuyls, K. Negative Update Intervals in Deep Multi-Agent Reinforcement Learning (2019). Available online : arXiv:1809.05096v3 [cs.MA].

Haarnoja, T., Tang, H., Abbeel, P. and Levine, S. Reinforcement Learning with Deep Energy-Based Policies (2017). Available online : arXiv:1702.08165v2 [cs.LG].

Hu, T. Chen, Z. Sun, H. Bai, J. Ye, M. and Cheng, G. Stein Neural Sampler (2018). Available online : arXiv:1810.03545 [stat.ML].

Tian, Z., Wen, Y., Gong, Z., Punakkath, F., Zou, S. and Wang, Y. A Regularized Opponent Model with Maximum Entropy Objective (2019). Available online : arXiv:1905.08087v2 [cs.MA].

Wei, E., Wicke, D., Freelan, D. and Luke, S. Multiagent Soft Q-Learning (2018). Available online : arXiv:1804.09817v1 [cs.AI].

Wen, Y., Yang, Y., Luo, R., Wang, J. and Pan, W. PROBABILISTIC RECURSIVE REASONING FOR MULTI-AGENT REINFORCEMENT LEARNING (2019). Available online : arXiv:1901.09207v2 [cs.LG].