Petuum’s Papers at ICML 2018

We’re thrilled to have five papers with authors from Petuum being presented this week at the 35th International Conference on Machine Learning in Sweden. This achievement is a testament to the cutting-edge machine learning work we’re doing at Petuum to support our goal of industrializing artificial intelligence. For more details on each paper, along with the day and time of its presentation, please see below. If you’re attending the conference, please stop by to say hello and hear more about our work!

DiCE: The Infinitely Differentiable Monte Carlo Estimator
Authors:
Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric Xing, and Shimon Whiteson

Wednesday, July 11, 11–11:20 a.m. @ Victoria (oral)
Wednesday, July 11, 6:15–9:00 p.m. @ Hall B #102 (poster)

The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), eg., in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher-order derivatives. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and numerical evaluation of the DiCE derivative estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://goo.gl/xkkGxN.

Gated Path Planning Networks
Authors:
Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, Eric Xing, and Ruslan Salakhutdinov

Wednesday, July 11, 2:10–2:20 p.m. @ A1 (oral)
Wednesday, July 11, 6:15–9:00 p.m. @ Hall B #134 (poster)

Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used by agents to perform navigation while still maintaining end-to-end differentiability of the entire architecture. Despite their effectiveness, they suffer from several disadvantages including training instability, random seed sensitivity, and other optimization problems. In this work, we reframe VINs as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation. From this perspective, we argue that standard gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN. The resulting architecture, which we call the Gated Path Planning Network, is shown to empirically outperform VIN on a variety of metrics such as learning speed, hyperparameter sensitivity, iteration count, and even generalization. Furthermore, we show that this performance gap is consistent across different maze transition types, maze sizes and even show success on a challenging 3D environment, where the planner is only provided with first-person RGB images.

Nonoverlap-Promoting Variable Selection
Authors:
Pengtao Xie, Hongbao Zhang, Yichen Zhu, and Eric Xing

Wednesday, July 11, 11:20–11:30 a.m. @ K1+K2 (oral)
Wednesday, July 11, 6:15–9:00 p.m. @ Hall B #82 (poster)

Variable selection is a classic problem in machine learning (ML), widely used to find important explanatory factors, and improve generalization performance and interpretability of ML models. In this paper, we consider variable selection for models where multiple responses are to be predicted based on the same set of covariates. Since each response is relevant to a unique subset of covariates, we desire the selected variables for different responses have small overlap. We propose a regularizer that simultaneously encourage orthogonality and sparsity, which jointly brings in an effect of reducing overlap. We apply this regularizer to four model instances and develop efficient algorithms to solve the regularized problems. We provide a formal analysis on why the proposed regularizer can reduce generalization error. Experiments on both simulation studies and real-world datasets demonstrate the effectiveness of the proposed regularizer in selecting less-overlapped variables and improving generalization performance.

Transformation Autoregressive Networks
Authors:
Junier Oliva, Kumar Avinava Dubey, Manzil Zaheer, Barnabás Póczos, Ruslan Salakhutdinov, Eric Xing, and Jeff Schneider

Friday, July 13, 10:10–10:20 a.m. @ Victoria (oral)
Friday, July 13, 6:15–9:00 p.m. @ Hall B #161 (poster)

The fundamental task of general density estimation p(x) has been of keen interest to machine learning. In this work, we attempt to systematically characterize methods for density estimation. Broadly speaking, most of the existing methods can be categorized into either using: a) autoregressive models to estimate the conditional factors of the chain rule, p(xi | xi-1,…); or b) non-linear transformations of variables of a simple base distribution. To better study the characteristics of these categories we propose multiple methods for each category. For example we propose RNN based transformations to model non-Markovian transformation of variables. Further, through a comprehensive study over both real world and synthetic data, we show for that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance. We illustrate the use of our models in outlier detection and image modeling. Finally we introduce a novel data driven framework for learning a family of distributions.

Orthogonality-Promoting Distance Metric Learning: Convex Relaxation and Theoretical Analysis
Authors:
Pengtao Xie, Wei Wu, Yichen Zhu, and Eric Xing

Friday, July 13, 5–5:20 p.m. @ A9 (oral)
Friday, July 13, 6:15–9:00 p.m. @ Hall B #72 (poster)

Distance metric learning (DML), which learns a distance metric from labeled “similar” and “dissimilar” data pairs, is widely utilized. Recently, several works investigate orthogonality-promoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness — achieving comparable performance on both frequent and infrequent classes; (2) high compactness — using a small number of projection vectors to achieve a “good” metric; (3) good generalizability — alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving non-convex optimization problems where achieving the global optimal is NP-hard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR’s capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods.