Reinforcement Learning: Dealing with Sparse Reward Environments

  1. Curiosity-Driven Methods:
    a. Intrinsic Curiosity Model [3,4]
    b. Curiosity in model-based RL [5]
  2. Curriculum Learning
    a. Automatic generation of easy goals [6]
    b. Learning to select easy tasks [7]
  3. Auxiliary Tasks [8]

Sparse Reward Task

Reward Shaping

Curiosity-Driven Methods

Intrinsic curiosity-driven exploration by self-supervised prediction

  1. The dynamics model predicts the next state s_t+1, considering the current state s_t and the selected action a_t. The dynamics model will not be able to predict the correct future state if the current state and (or) the taken action are (is) unknown. The deviation between the model’s prediction and the actual state is used as a measure of novelty. If the agent continually optimizes his prediction, but at the same time searches for the states where his prediction is incorrect, the agent will continuously take action to visit new states.
  2. The Inverse Model predicts the action a_t, which is the cause of the transition from state s_t to s_t+1, where the states s_t and s_t+1 are the inputs of the network. The idea of the inverse model is that ICM is encouraged to embed only those features of the observations that are relevant to the prediction of the corresponding action. In this way, the agent does not focus on information in the input space that does not influence its choice of action.
    Note: If you have direct access to the full state space, the inverse model is no longer required.
  1. L_I: measures the discrepancy between the predicted action obtained by using the inverse dynamic model and the real action a_t, we try to minimize this term of the objective function.
  2. L_F: decrease this term of the objective function improves the prediction of the dynamic model.
  3. R: the expected cumulative extrinsic rewards

Planning to Explore via self-supervised World Models

  1. The authors claim that method, like the intrinsic curiosity model, need a model-free exploration policy that requires a large amount of data when adapting it to a specific task.
  2. Another down-side of previous curiosity methods is that the curiosity of a state that was recently visited is calculated. The agent will then seek states it has already seen instead of new states.
  3. Instead of seeking actions where the agent’s prediction of the next state has a high error compared to the real future state, Sekar et al. decided to use an ensemble of dynamic models and calculate the disagreement of the predictions of the next states.

Curriculum learning

Automatic Goal Generation for Reinforcement Learning

Auxiliary Tasks

  1. Pixel Changes: The idea is that rapidly changing pixels are an indicator of significant events. The agent will try to control the change of pixels by choosing the right actions.
  2. Network Features: Here, the agent will try to control the activation of the hidden layers of its value and policy network. Because those networks typically can extract high-level features, it could be useful if the agent can control their activation.
  1. Pixel Control: An auxiliary policy is trained to maximize the change of pixels intensity in different regions of the input image.
  2. Reward Prediction: Given three frames from the replay buffer, the network tries to predict the reward for the next, unseen frame. Because the reward is sparse, skewed sampling is applied, leading to more frames where a reward was given. The purpose of the reward predictor is solely to shape the feature layers of the agent that convert the high-dimensional input space into a low-dimensional latent space.
  3. Value Function Replay {It is unclear whether Jaderberg et al. sees the Value Function Replay as an additional task or rather just an optimization strategy for the training.}: The agent’s value function is trained additionally on samples from the replay buffer, besides the on-policy value function training in A3C. The value iteration is performed on a varying frame length and exploits the newly discovered features, shaped by the reward predictor.

Conclusion

References

--

--

--

Deep RL PhD Student@Karlsruhe Institute of Technology

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Overall Accuracy is Not A Good Metric for Binary Classification Models

SimCLR, Part 1: Self-Supervised Learning and Contrastive Methods

A Quick Introduction about Machine Learning & It’s Types

Gaussian Process Kernels

Image Classification App using TensorFlow Lite & Flutter

Role of Confusion Matrix in Cyber Security !!

An introduction to surrogate modeling, Part I: fundamentals

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karam Daaboul

Karam Daaboul

Deep RL PhD Student@Karlsruhe Institute of Technology

More from Medium

What is TensorFlow?

Transformer for All Data Types

AI and Deep Learning: A Guide to What It Is, What It Does, And How To Get Started?

GENETIC ALGORITHMS IN MACHINE LEARNING