Reinforcement learning with Skinner

A friendly introduction to the problem of reinforcement learning with examples from neuroscience

Published in

Analytics Vidhya

8 min readNov 14, 2019

Reinforcement learning has entered the spotlight recently with accomplishments such as AlphaGo, and is supposedly one of our best shots at Artificial General Intelligence — or at least more general intelligence. In this post, I will trace some of its history back to the study of operant conditioning by Skinner.

The real question is not whether machines think but whether men do — B. F. Skinner

B. F. Skinner working with an Operant Conditioning Chamber of his creation. Image taken from Here.

Skinner wanted to understand how animals develop adaptive behavior, what were the rules underlying learning. Many other scientists were interested in conditioning around that time, such as Ivan Pavlov — famous for showing that after pairing a bell with a beef, dogs salivated to the sound of the bell.

The main difference between Skinner and his counterparts was his thoroughness in making very controlled experiments. Skinner developed a chamber — now called operant conditioning chamber, or just Skinner Box — in which the animals, typically rats and pigeon, could be isolated from external sound, smell and light, and stimulated precisely for each experiment. Marvin Minsky jokingly compares Skinner’s meticulousness with Pavlov’s experiments in a lab full of caged dogs, far off in terms of care and control.

Rat inside a conditioning chamber. There are two lights that can be used to stimulate the animal, and two levers the animal can use to respond. The sucrose solution is controlled by an automated system. Image taken from Malkki et al. 2010

Reward and repeat

The animals would receive a specific stimulus such as a light, sound, or smell, and the information from the stimulus could be used to gain some food or water (a reinforcer). But the rat needed to execute some specific action to be rewarded with the reinforcer, choosing correctly between the small set of possible actions to be undertaken. There could be a discriminatory task where a single light would go on, and if the light was green the animal would be rewarded for pressing the lever right below. On the other hand, if the light was red the animal would be rewarded for pressing the contralateral lever.

T-Maze for an operant conditioning task. Image from Smith, Kyle S., and Ann M. Graybiel, 2013

After some trial and error, the animals started behaving in such a way to increase their rate of reward, as if understanding the rules guiding their rewarding, as if understanding that red means “the other lever”. Moreover, if the animals were rewarded at a higher rate, they (generally) learned faster. A large number of controlled spaces have been created inspired by the Skinner box. Take or example the T-Maze show here, with a starting location and a decision point. Depending on the sound played (the tone cue), either the left or right arm would contain its respective reinforcer. The animal eventually learns to go to the right arm depending on the sound, increasing its rate of reward.

Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation — Thorndike’s Law of Effect

The study of operant conditioning is still very active, with a lot of branches in development such as on the dynamics of habituation, e.g. how much training it takes for a behavior to lose flexibility — becoming resistant to devaluation — and what are the underlying processes involved. Without getting too much involved with the possible algorithms our brains use, this post focuses more on delineating the problem. Of special interest to those studying artificial intelligence are sequential tasks, in which many actions need to be taken before rewards are attained.

Sequential tasks

A great example of a sequential task is a maze. There are many others, in which the contingencies at each step depend on previous ones. But in the labyrinth, the sequential aspect is spatially distributed, so it is as clear as it could be. Imagine the animal runs exploring the maze until it finds the reward (and is removed from the maze to start again). Following strictly the law of effect, the animals would try and repeat the same quasi-random jug around the maze until they found the reward again the same way, but this is clearly inefficient. In fact, it is clear from experiments that animals get more efficient with training, up to the trial when they go directly to the reward without making any “mistakes”.

The problem that animals contend with is the Credit Assignment problem, viz how to reinforce those actions that truly help to bring about the reward without reinforcing those actions that just happened to be enacted close to the reward? In fact, there are many registered cases of pigeons and cats making repeated and completely unnecessary actions before pressing levers (e.g. Guthrie 1946), cases where the credit assignment was evidently not optimal. The problem is a big one, and each advancement in this direction is a potential huge improvement for reinforcement learning systems of our creation.

To be clear, this is not a marginal problem: it is the central complication tackled by Reinforcement Learning. In this setting actions are distant from rewards, and the “perfect response” may not even be well defined. Compare this with supervised learning — where the “correct” response is specified and shown at each step. The additional difficulty is exactly what makes Reinforcement Learning so broad, and our proposedly best shot at Artificial General Intelligence.

Reinforcement Learning Formalism — A sketch

RL resembles the skinner box. An Agent has access to one state from a specified set of States (in the previous example this could be a specific left green light on) and may choose some Action (pressing the left lever, right lever, not pressing, …). Then, after acting in the environment the Agent receives a Reward (e.g. food, nothing, …) and perceives itself in a new State.

Learning is: increasing the rate of rewards

To increase the number of rewards during a task, the agent must have an account of “which is the best action at each state”. This originates an optimal policy — a program to decide actions — , that reaches the maximum expected rate of reward. The existence of an optimal policy is mathematically well-defined when either one of the two following criteria is met:

The task is finite, or
Rewards later in the future are less valuable than rewards closer to the present (there is a discount rate).

In the case of operant conditioning, the tasks are obviously finite, but this does not imply a lack of discounting. In fact, there is a very contemporary discussion around delay discounting and its implications for human living, for example, its relationship with drug abuse (Bickel and Marsch, 2001).

On the other hand, for artificial agents training to perform a continuous task (like playing Minecraft, which is not finite), it is important to have a small discount factor, to ensure that there exists an optimal policy for the agent to learn.

Using the algorithm

Ok, so we could not end a Reinforcement Learning introduction without a little snooping into the equations that make it possible in computers. I will bypass the formalism, giving instead a small and intuitive derivation of an algorithm that can be used to find the optimal policy. I do this to illustrate how fast we can go from the theory to an algorithm. Before going into the image, we need only build a small intuition on values for actions:

The best action is the one that maximizes the future expected reward.
If we know the future expected reward for taking each action, then we can always choose the best action.
If we can always know the best action, then we have reached the optimal policy.

The idea here is then to find this Value function that outputs the expected return value of taking an action in a state. You will see that we start with the definition of the value function in (1), and end up with the algorithm in (5).

SARSA algorithms with some simplifications. The last equation can be used on-line in a loop of interaction with the environment. Here the discount rate is set to 1, and the Q-function is called V for simplicity.

With this algorithm, it is possible to iteratively learn by interacting with the environment. In each step, the agent observes its state s and takes an action a, updating its Values according to the received reward and to the next state-action pair. Remember the maze problem? Because it has a discrete and finite set of states (decision points, the bifurcations) and actions (e.g. go left, go right), we can solve it using a Q-table like the one below.

Using this Q-table, the agent will take the left in the third bifurcation, since the value of action 1 is the highest. The table will be updated at each step, eventually converging to an optimal policy. Here I show a very simple agent that learns how to hold a pole using this algorithm!

You can see that it needs a lot of repetitions to perform acceptably, but this is because there are a lot of improvements that can be made for this setting.

Truncated Q-Table for the discretized CartPole. With a not-so-fine discretization of 20 bins for each dimension, there is a humongous total of 20⁴ = 160.000 states. In these cases, function approximators are the way to go.

Because the states are continuous, we could improve a lot over the discretization, using instead a continuous function approximator such as a neural network! Nevertheless, the formulation of the problem is still the same, and introducing the problem was the central aim of this blog post.

Conclusion

This post was a very short bridging introduction to reinforcement learning and operant conditioning and I intend to write follow-ups going deeper into the theory and math underlying both, showing increasingly better and more complex algorithms and relate them to neuroscience.

I hope to get you as interested in Reinforcement Learning as I am! I believe (as a lot of people do) that neuroscience has a lot to offer the field of Artificial Intelligence, especially with high-level insights. Please comment and give feedback, and thank you for reading!

References

Bickel, W. K., & Marsch, L. A. (2001). Toward a behavioral economic understanding of drug dependence: delay discounting processes. Addiction, 96(1), 73–86.

Dam, G., Kording, K., & Wei, K. (2013). Credit assignment during movement reinforcement learning. PLoS One, 8(2), e55352.

Guthrie, E. R., & Horton, G. P. (1946). Cats in a puzzle box.

Malkki, H. A., Donga, L. A., De Groot, S. E., Battaglia, F. P., & Pennartz, C. M. (2010). Appetitive operant conditioning in mice: heritability and dissociability of training stages. Frontiers in behavioral neuroscience, 4, 171

Smith, K. S., & Graybiel, A. M. (2013). A dual operator view of habitual behavior reflecting cortical and striatal dynamics. Neuron, 79(2), 361–374.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.