Original Research: Human Reinforcement Learning in Static vs Dynamic Environments

The Curious Learner
Photo by Bryan Burgos on Unsplash

In this series on Original Research, I will be sharing about my findings from some of the mini-projects that I have carried out on my own.

In life, we are accustomed to the idea that practicing on variations of the same task prepares us for scenarios that we have yet to encounter. For example, students learning algebra should solve different types of problems to understand the underlying mathematical concepts, and pilots learning to fly a plane should be exposed to different in-flight emergencies to know what to do when something out of the norm occurs.

These examples are the embodiment of function learning, which is essentially reinforcement learning based on specific functions (Schulz et al., 2016). However, what happens when there is an unexpected change in the function that has been learnt? Will people be able to adapt to the change? Do people really adapt better if they were learning in an environment with variations?

This study attempted to investigate these questions by examining how reinforcement learning works in a spatially correlated multi-armed bandit task that undergoes a function change. The function change is represented by a change in either the state function or reward function. Before the function change, participants were also subjected to either a static environment (where the ‘training’ function remains the same throughout) or a dynamic environment (where the ‘training’ function had slight variations).

Research Questions & Hypotheses

RQ1: Does a dynamic environment really prepare people for function changes?

RQ2: Do people learn to adapt faster when the state or reward function is changed?


149 participants were recruited through MTurk, to play a Submarine Commander Game where enemy ships with different amount of points were hidden throughout the map, based on the function of the assigned group. The task of the participants was to locate and fire at ships to obtain as many points as possible over 9 blocks. For the first 8 blocks, participants were randomly assigned into one of four groups (Static-StateChange, Dynamic-StateChange, Static-RewardChange and Dynamic-RewardChange). But in the 9th block, all groups experienced a change that resulted in the same function. Participants received a base payment for completing the study, and a bonus payment depending on the points they earned.

In the images below, each coloured state function on the participant’s display represents where the torpedoes will land in a single block, while the reward function on the right represents the amount of points that will be earned for the distance traveled by the torpedo. It is important to remember that the participants are not actually able to see these functions, but may figure them out through trial-and-error.

For the State Change Condition, participants in the Dynamic Environment will experience a different state function on every block, and receive the red state function in the final block. On the other hand, participants in the Static Environment will be given one randomly chosen state function throughout all the blocks, and then receive the same red state function as the Dynamic Environment participants in the final block. The reward function for this group of participants never changes, and the high points are mostly in the mid-distance region.

For the Reward Change Condition, participants in the Dynamic Environment will experience a different reward function on every block, and receive the red reward function in the final block. As the state function for this group of participants never changes, this means that the location of the high points are always changing, and appear in the mid-distance region in the final block. On the other hand, participants in the Static Environment will be given one randomly chosen reward function throughout all the blocks, and then receive the same red reward function as the Dynamic Environment participants in the final block.

Analysis of Block Scores

The most straightforward way of testing the hypotheses is to compare the scores in the final block, which is the block where all groups experienced a change. However, the results showed no significant differences between the different groups and conditions.

I was then interested to see how the different groups performed over the 9 blocks, and conducted a Repeated-Measures ANOVA to test for any differences. The plots below clearly show that all groups were improving over the blocks, suggesting that participants learnt how to obtain higher points over time. This includes the Dynamic groups despite the changing functions on every block. A closer examination of the plots showed a sudden drop in scores for the Static groups from Block 8 to 9, while the Dynamic groups were more or less the same. The score difference between Block 8 and 9 was then computed for further investigation.

A comparison of the score difference between Block 8 and 9 revealed a significant difference between Static and Dynamic groups (below). Dynamic groups were not affected by the function change in the final block, but the Static groups who were experiencing a function change for the first, suffered a shock in performance for the final block.

Analysis of Exploration Behaviour

Another interesting dependent measure was to investigate how much participants explored for high points over the course of the experiment. One would expect the Static groups to explore less over time, once they figured out that the high points rarely changed locations. The Exploration Index was calculated by taking the standard deviation of the positions chosen by participants within a block.

As can be seen from the plots above, exploration behaviour generally decreased over the blocks as the participants became familiar with the task. There was a significant interaction effect between Blocks and Type of Environment, indicating that the type of environment may influence the exploration behaviour over time. In this case, it was most likely due to the decrease in exploration beahviour by the Static groups over time.

Interestingly, a comparison of exploration behaviour between the groups in the final block revealed that the Reward Change groups seem to explore less (above), suggesting that a reward change may be easier to adapt than a state change. This seems to be in line with the hypothesis for the second research question.

Analysis of Learning Over Trials

To better understand how participants from each group were learning within each block across all the blocks, the mean scores from every trial was analysed (below).

The Repeated-Measures ANOVA test results showed significant interactions between Trials and Type of Environment, and Type of Function Change, indicating that the type of environment and type of function change may influence the learning within a block. As as can be seen from the trends in the plots, participants generally become better over trials and blocks, but the Dynamic groups (top right plot) usually require information from the first 2 trials to find the high points, even as they progressed to later blocks.


Despite having to learn a new variation on every block, the Dynamic groups were still able to improve over time, which is a sign that learning is possible in an ever-changing environment. Even though the Dynamic groups did not obtain a significantly higher average score than the static groups in the final block, they might have done better in the sense that they were not affected by the change that happened from Block 8 to 9, as compared to the huge drop in the Static groups. This finding suggests that a dynamic environment may be useful in building a more adaptable causal structure that is resilient to changes.


I would like to thank my supervisor, Dr Maarten Speekenbrink, for showing interest in the ideas that I proposed, guiding me in the design of the study and helping me with the interpretation of the results. I would also like to thank my other supervisor, Dr Eric Schulz, for his patience in helping me to solve problems that I encountered, and giving me lots of encouragement and valuable feedback that helped me to learn, despite him being located out of the UK and also having to work on his PhD thesis at the time I was working on this research project. Funding to conduct the study was from UCL Department of Experimental Psychology.

This research was presented at the 3rd Advances in Decision Analysis Conference of the Decision Analysis Society in June 19–21, 2019 at Bocconi University, Milan, Italy, and won the Best Poster Award.


  • Gläscher, J., Daw, N., Dayan, P., & O’Doherty, J. P. (2010). States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66(4), 585–595.
  • Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2016). Putting bandits into context: How function learning supports decision making. bioRxiv, 081091.

The Curious Learner

Written by

Knowledge Sharing on Science, Social Science and Data Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade