There are plenty of papers/research on demonstrating the superior performance of their Reinforcement Learning algorithms. One paper (Deep Reinforcement Learning that Matters) published in 2018 shows that some claims of the out-performance may simply be due to the randomness of the RL algorithm. The experiments in the paper shows even the same RL algorithm with the same hyper-parameters, the different random seeds can produce drastically different results.
In RL, The randomness comes from multiple sources:
1. Randomness in the action space of the gym
2. Randomness of the starting state of the gym
3. Randomness of the initial value of the parameters of the deep learning network
4. Random component in the expected value of the actions
The general belief is that the sufficient trials of the RL can lead RL algorithm to converge to a stable state where the impact of randomness of the initial parameters is minimized. However, there is no guideline of how many trails are sufficient for the convergence.
The difference between reproducible result and deterministic result.
The reproducible result does not require to remove the randomness completely. As long as the sequence of random number can be easily reproduced, the final outcome usually can be reproduced. On the other hand, the deterministic result requires the randomness to be removed completely. In some test environment, the agent may be required to generate the deterministic result.
The modern computer uses a pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), to generate the random number sequence. The PRNG is an algorithm for generating a sequence of numbers whose properties approximate the properties of sequences of random numbers. The common practice is to set the seed of the PRNG to generate the reproducible sequence of random number.
The benefit of introducing the randomness to the RL
The randomness of the starting state of the gym make sure the RL algorithm to be trained on various states to construct an optimal policy under all scenarios.
The random component in the expected value of the actions is essential for learning. It facilitates the balance between exploration and exploitation. Without the randomness, the agent could easily get stuck in bad actions and never explores other options, and also does not quite fit in with the theory behind some RL algorithms, e.g. PPO (the definition expects actions sampled according to policy, not deterministic actions, although PPO’s policy clipping might help fixing this).
In some circumstance, the multiple randomness in the RL algorithms are not desirable, especially when the purpose of experiments is to compare the performance. The random seeds need to be set to be deterministic random between experiments to properly measure the performance. Here the gym environment “CartPole-v1” is used to demonstrate how to reproduce the result. The implementation of RL algorithms is the python package, stable-baseline 2.8.0.
# set the seedseed = 1env_id = ‘CartPole-v1’env = DummyVecEnv([make_env(env_id, 0) for i in range(n_cpu)])set_global_seeds(seed)model = PPO2(‘MlpPolicy’, env, verbose=1)model.learn(total_timesteps=25000, seed=seed)# set the seed for testingseed = 2env.envs[0].seed(seed)obs = env.reset()for count in range(1000):action, _states = model.predict(obs, deterministic=True)obs, rewards, dones, info = env.step(action)env.render()
The function set_global_seeds calls the tensorflow function, tf.set_random_seed, to generate the deterministic initial value of coefficients of the deep learning network.
Set the deterministic sequence of random number as starting positions:
env.envs[0].seed(seed)The parameter, deterministic, switches the predict function to the deterministic mode to generate deterministic actions without randomness in the testing phase:
model.predict(obs, deterministic=True)However, the setting above only controls three sources of randomness.
1. Randomness of the starting state of the gym
2. Randomness of the initial parameters of the deep learning network
3. Random component in the expected value of the actions
The new version of stable-baseline 2.9.0, the PPO constructor will have extra 2 parameters, seed and n_cpu_tf_sess. In this new setting, the function tf_util.make_session sets two parameters (inter_op_parallelism_threads and intra_op_parallelism_threads) of the tensorflow session (tf.Session) to 1. There are several possible forms of parallelism when running a TensorFlow graph, and these options provide some control multi-core CPU parallelism. In order to get the reproduciable result, the parallelism is not allowed. The further work is needed to understand this issue.
tf_config = tf.ConfigProto(allow_soft_placement=True,inter_op_parallelism_threads=num_cpu,intra_op_parallelism_threads=num_cpu)
Furthermore, the randomness of the action space is controlled by setting “action_space.seed”. The parameter seed in the function learn will be removed in the new version 2.9.0.
In order to complete the experience, the PPO2 constructor in the version 2.8.0 is modified by adding a few lines:
self.seed = 2
self.set_random_seed(self.seed)
n_cpu = 1
self.sess = tf_util.make_session(num_cpu=n_cpu, graph=self.graph)After making all these changes, the actions in the train/learn step are deterministic random. The result of the RL are reproducible.
