Reinforcement Learning in NIPS 2018

12 min readSep 23, 2018

This is a selection of papers and workshops about reinforcement learning (RL) and relevant topics in NIPS 2018. All RL papers are selected and categorized into topics of value function, policy, reward, model, exploration, unsupervised learning, hierarchical RL, multi-agent RL, and, learning to learn. Some papers about deep learning, machine learning, and, AI are selected. Within each topic, papers are ordered alphabetically by first authors’ last names. All invited talks and tutorials are excellent, and not many of them, so I don’t list them here. There is an AutoML Challenge.

See my previous blog about Resources for Deep Reinforcement Learning.

Deep Learning

Jianlong Chang, Jie Gu, Lingfeng Wang, Gaofeng Memg, Shiming Xiang, and Chunhong Pan. Structure-aware convolutional neural networks.
Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction.
Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, LIANHUI Qin, Xiaodan Liang, Haoye Dong, and Eric Xing. Deep generative models with learnable knowledge constraints.
Nan Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Laurent Charlin, Michael Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding.
Louis Kirsch, Julius Kunze, and David Barber. Modular networks: Learning to decompose neural computation.
Ashish Kumar, Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Visual memory for robust path following.
Isaac Lage, Andrew Ross, Samuel J Gershman, Been Kim, and Finale Doshi- Velez. Human-in-the-loop interpretability prior.
Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? a large-scale study.
David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining neural networks.
Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li Fei-Fei, Josh Tenenbaum, and Daniel Yamins. A flexible neural representation for physics prediction.
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? (no, it is not about internal covariate shift).
Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Generative adversarial examples.
Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units.
Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study.

Machine Learning

Bart van Merrienboer, Olivier Breuleux, Arnaud Bergeron, and Pascal Lamblin. Automatic differentiation in ML: Where we are and where we should be going.

Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. DeepProbLog: Neural probabilistic logic programming.
Ofir Marom and Benjamin Rosman. Zero-shot transfer with deictic object-oriented representation in reinforcement learning.
Rasmus Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks.
Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Tim Lillicrap. Relational recurrent neural networks.
Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene?
Aviv Tamar, Pieter Abbeel, Ge Yang, Thanard Kurutach, and Stuart Russell. Learning plannable representations with causal InfoGAN.
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural-Symbolic VQA: Disentangling reasoning from vision and language understanding.
Lisa Zhang, Gregory Rosenblatt, Ethan Fetaya, Renjie Liao, William Byrd, Matthew Might, Raquel Urtasun, and Richard Zemel. Neural guided con- straint logic programming for program synthesis.

Value Function

Steven Hansen, Alexander Pritzel, Pablo Sprechmann, Andre Barreto, and Charles Blundell. Fast deep reinforcement learning using online adjustments from the past.
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael Jordan. Is Q- learning provably efficient?
Tyler Lu, Craig Boutilier, and Dale Schuurmans. Non-delusional Q-learning and value-iteration.
Motoya Ohnishi, Motoya Ohnishi, Masahiro Yukawa, Mikael Johansson, and Masashi Sugiyama. Continuous-time value function approximation in reproducing kernel hilbert spaces.
Devavrat Shah and Qiaomin Xie. Q-learning with nearest neighbors.
Tom Zahavy, Matan Harush, Nadav Merlis, Daniel J Mankowitz, and Shie Mannor. Learn what not to learn: Action elimination with deep reinforcement learning.

Policy

Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning.
Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients.
Nathan Kallus. Balanced policy evaluation and learning.
Shauharda Khadka and Kagan Tumer. Evolutionary reinforcement learning.
Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi-Velez, and Emma Brunskill. Representation balancing MDPs for off-policy policy evaluation.
Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning.
Alberto Maria Metelli, Matteo Papini, Francesco Faccio, and Marcello Restelli. Policy optimization via importance sampling.
Paavo Parmas. Total stochastic gradient algorithms and applications in reinforcement learning.
Wen Sun, Geoffrey Gordon, Byron Boots, and J. Bagnell. Dual policy iteration.
Qiang Liu, Lihong Li, Ziyang Tang, and Denny Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation.
Andrea Tirinzoni, Marek Petrik, Xiangli Chen, and Brian Ziebart. Policy- conditioned uncertainty sets for robust Markov decision processes.
Jack Umenberger and Thomas B Schön. Learning convex bounds for linear quadratic control policy synthesis.
Angela Zhou and Nathan Kallus. Confounding-robust policy improvement.

Reward

Imitation learning; inverse reinforcement learning; reward shaping; etc.

Stuart Armstrong and Sören Mindermann. Impossibility of deducing preferences and rationality from human policy.
Justin Fu, Sergey Levine, Dibya Ghosh, Larry Yang, and Avi Singh. An event-based framework for task specification and control.
Mike Gimelfarb, Scott Sanner, and Chi-Guhn Lee. Reinforcement learning with multiple experts: A Bayesian model combination approach.
Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching inverse reinforcement learners via features and demonstrations.
Jiexi Huang, Fa Wu, Doina Precup, and Yang Cai. Learning safe policies with expert guidance.
Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Maximum causal Tsallis entropy imitation learning.
Jan Leike, Borja Ibarz, Dario Amodei, Geoffrey Irving, and Shane Legg. Reward learning from human preferences and demonstrations in atari.
Jorge Armando Mendez, Shashank Shivkumar, and Eric Eaton. Lifelong inverse reinforcement learning.
Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K Ho, and Sanjit Seshia. Learning task specifications from demonstrations.
Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods.

Model

Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable MPC for end-to-end planning and control.
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion.
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Data-efficient model-based reinforcement learning with deep probabilistic dynamics models.
Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J. Zico Kolter. End-to-end differentiable physics for learning and control.
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.
Nick Haber, Damian Mrowca, Stephanie Wang, Li Fei-Fei, and Daniel Yamins. Learning to play with intrinsically-motivated, self-aware agents.
Amir massoud Farahmand. Iterative value-aware model learning.

Exploration

Maria Dimakopoulou, Ian Osband, and Benjamin Van Roy. Scalable coordinated exploration in concurrent reinforcement learning.
Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in non-communicating Markov decision processes.
Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Meta-reinforcement learning of structured exploration strategies.
Zhang-Wei Hong, Tzu-Yun Shann, Shih-Yang Su, Yi-Hsiang Chang, Tsu-Jui Fu, and Chun-Yi Lee. Diversity-driven exploration strategy for deep reinforcement learning.
Raksha Kumaraswamy, Matthew Schlegel, Adam White, and Martha White. Context-dependent upper-confidence bounds for directed exploration.
Vashisht Madhavan, Felipe Petroski Such, Jeff Clune, Kenneth Stanley, and Joel Lehman. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents.
Jungseul Ok, Damianos Tranos, and Alexandre Proutiere. Exploration in structured reinforcement learning.
Ian Osband, John S Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning.

Exploration: Bandits

Ilai Bistritz and Amir Leshem. Distributed multi-player bandits — a game of thrones approach.
Mario Bravo, David Leslie, and Panayotis Mertikopoulos. Bandit learning in concave n-person games.
Lixing Chen, Jie Xu, and Zhuo Lu. Contextual combinatorial multi-armed bandits with volatile arms and submodular reward.
Shi Dong and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling for large action spaces.
Bianca Dumitrascu, Barbara Engelhardt, and Karen Feng. PG-TS: Improved thompson sampling for logistic contextual bandits.
Dylan Foster and Akshay Krishnamurthy. Contextual bandits with surrogate losses: Margin bounds and efficient algorithms.
Dalin Guo and Angela J Yu. Why so gloomy? a Bayesian explanation of human pessimism bias in the multi-armed bandit task.
Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits.
Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene?
Yi Qi, Hongning Wang, and Qingyun Wu. Bandit learning with implicit feedback.
Virag Shah, Jose Blanchet, and Ramesh Johari. Bandit learning with positive externalities.
Han Shao, Xiaotian Yu, Irwin King, and Michael Lyu. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs.
Siwei Wang and Longbo Huang. Multi-armed bandits with compensation.
Julian Zimmert and Yevgeny Seldin. Factored bandits.

Unsupervised Learning/Self-Supervised Learning

Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching YouTube.
Vikash Goel, Jameson Weng, and Pascal Poupart. Unsupervised video object segmentation for deep reinforcement learning.
Piotr Mirowski, Matt Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, koray kavukcuoglu, Andrew Zisserman, and Raia Hadsell. Learning to navigate in cities without a map.
Ashvin Nair, Vitchyr Pong, Shikhar Bahl, Sergey Levine, Steven Lin, and Murtaza Dalal. Visual goal-conditioned reinforcement learning by representation learning.
Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning.

Hierarchical RL

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning.
Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options.

Multi-agent RL

Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, and Mehryar Mohri. Policy regret in repeated games.
Yanlin Han and Piotr Gmytrasiewicz. Learning others’ intentional models in multi-agent settings using interactive POMDPs.
Edward Hughes, Joel Leibo, Matthew Phillips, karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, Heather Roff, and Thore Graepel. Inequity aversion improves cooperation in intertemporal social dilemmas.
Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation.
Marc Lanctot, Sriram Srinivasan, Vinicius Zambaldi, Julien Perolat, karl Tuyls, Remi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments.
Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Credit assignment for collective multiagent RL with global rewards.
Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning.
Hoi-To Wai, Princeton Zhaoran Wang, Zhuoran Yang, and Mingyi Hong. Multi-agent reinforcement learning via double averaging primal-dual optimization.
Yan Zhang, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, and Changjie Fan. A deep Bayesian policy reuse approach against non-stationary agents.
Zhengyuan Zhou, Panayotis Mertikopoulos, Susan Athey, Nicholas Bambos, Peter W Glynn, and Yinyu Ye. Multi-agent online learning with asynchronous feedback loss.

Learning to Learn

Few/One/Zero-shot Learning; Transfer Learning; Multi-task Learning; Learning to Optimize; Learning to Reinforcement Learn; AutoML

Aniket Bajpai, Sankalp Garg, and Mausam. Transfer of deep reactive policies for MDP planning.
Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction.
Tao Chen, Adithyavairavan Murali, and Abhinav Gupta. Hardware conditioned policies for multi-robot transfer learning.
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs.
Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients.
Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural architecture search with Bayesian optimisation and optimal transport.
Shichen Liu, Mingsheng Long, Jianmin Wang, and Michael Jordan. Generalized zero-shot learning with deep calibration network.
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tieyan Liu. Neural architecture optimization.
Ofir Marom and Benjamin Rosman. Zero-shot transfer with deictic object-oriented representation in reinforcement learning.
Massimiliano Pontil, Giulia Denevi, Carlo Ciliberto, and Dimitris Stamos. Learning to learn around a common mean.
Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.
Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Multitask reinforcement learning for zero-shot generalization with subtask dependencies.
Bradly Stadie, Ge Yang, Pieter Abbeel, Yuhuai Wu, Yan Duan, Xi Chen, Rein Houthooft, and Ilya Sutskever. The importance of sampling in meta- reinforcement learning.
Andrea Tirinzoni, Rafael Rodriguez, and Marcello Restelli. Transfer of value functions via variational methods.
Rasul Tutunov, Dongho Kim, and Haitham Bou Ammar. Distributed multitask reinforcement learning with quadratic convergence.
Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, and Swarat Chaudhuri. Synthesis of differentiable functional programs for lifelong learning.
Tongzhou Wang, YI WU, David Moore, and Stuart Russell. Meta-learning MCMC proposals.
Catherine Wong, Neil Houlsby, Yifeng Lu, and Andrea Gesmundo. Transfer learning with neural AutoML.
Ju Xu and Zhanxing Zhu. Reinforced continual learning.
Kelvin Xu, Chelsea Finn, and Sergey Levine. Uncertainty-aware few-shot learning with probabilistic model-agnostic meta-learning.
Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning.
Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning.
Yu Zhang, Ying Wei, and Qiang Yang. Learning to multitask.
Han Zhao, Shanghang Zhang, Guanhang Wu, José M. F. Moura, Joao P Costeira, and Geoffrey Gordon. Adversarial multiple source domain adaptation.

Safety

Yinlam Chow, Ofir Nachum, Mohammad Ghavamzadeh, and Edgar Duenez-Guzman. A Lyapunov-based approach to safe reinforcement learning.
Jiexi Huang, Fa Wu, Doina Precup, and Yang Cai. Learning safe policies with expert guidance.
Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning.

Applications

Sander Dieleman, Aaron van den Oord, and Karen Simonyan. The challenge of realistic music generation: modelling raw audio at scale.
Yuanxiang Gao, Li Chen, and Baochun Li. Post: Device placement with cross-entropy minimization and proximal policy optimization.
Aaron Havens, Zhanhong Jiang, and Soumik Sarkar. Online robust policy learning in the presence of unknown adversaries.
Zehong Hu, Yitao Liang, Yang Liu, and Jie Zhang. Inference aided reinforcement learning for incentive mechanism design in crowdsourcing.
Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits.
Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvári. Toprank: A practical algorithm for online stochastic ranking.
Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control.
Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning.
Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric Xing. Hybrid retrieval-generation reinforced agent for medical image report generation.
Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V Le, and Ni Lao. Memory augmented policy optimization for program synthesis with generalization
Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi-Velez, and Emma Brunskill. Representation balancing MDPs for off-policy policy evaluation.
Renato Negrinho, Matthew Gormley, and Geoffrey Gordon. Learning beam search policies via imitation learning.
Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, Russ Tedrake, and John C Duchi. Scalable end-to-end autonomous vehicle testing via rare-event simulation.
Afshin Oroojlooy, Lawrence Snyder, and Martin Takac. Reinforcement learning for solving the vehicle routing problem.
Laurent Orseau, Levi Lelis, Tor Lattimore, and Theophane Weber. Single-agent policy tree search with guarantees.
Yu-Shao Peng, Kevin Tang, Hsuan-Tien Lin, and Edward Chang. Exploring sparse features in deep reinforcement learning towards fast disease diagnosis.
Pierre Thodoroff, Audrey Durand, Joelle Pineau, and Doina Precup. Temporal regularization for Markov decision process.
Utkarsh Upadhyay, Abir De, and Manuel Gomez Rodriguez. Deep reinforcement learning of marked temporal point processes.
Yining Wang, Xi Chen, and Yuan Zhou. Near-optimal policies for dynamic multinomial logit assortment selection models.
Romain Warlop, Alessandro Lazaric, and Jérémie Mary. Fighting boredom in recommender systems with linear reinforcement learning.

More Topics

Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction.
Simyung Chang, John Yang, Jaeseok Choi, and Nojun Kwak. Genetic-gated networks for deep reinforcement learning.
Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert Schapire. On oracle-efficient PAC RL with rich observations.
Nishant Desai, Andrew Critch, and Stuart J Russell. Negotiable reinforcement learning for pareto optimal sequential decision-making.
Mathieu Fehr, Olivier Buffet, Vincent Thomas, and Jilles Dibangoye. rho- POMDPs have Lipschitz-continuous epsilon-optimal value functions.
Mahdi Imani, Seyede Fatemeh Ghoreishi, and Ulisses M. Braga-Neto. Bayesian control of large MDPs with uncertain dynamics in data-poor environments.
Jongmin Lee, Geon hyeong Kim, Pascal Poupart, and Kee-Eung Kim. Monte-Carlo tree search for constrained POMDPs.
Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model.
Josef Urban, Cezary Kaliszyk, Henryk Michalewski, and Mirek Miroslav Olšák. Reinforcement learning of theorem proving.

Workshops

Reinforcement Learning in NIPS 2018

Written by Yuxi Li