Open AI’s GPT-3 Paper Shares NeurIPS 2020 Best Paper Award With Works from Politecnico di Milano, CMU and UC Berkeley

Published in

SyncedReview

9 min readDec 7, 2020

OpenAI’s groundbreaking GPT-3 language model paper, a no-regret learning dynamics study from Politecnico di Milano & Carnegie Mellon University, and a UC Berkeley work on data summarization have been named the NeurIPS 2020 Best Paper Award winners. The organizing committee made the announcements this morning, along with their Test of Time Award, to kick off the thirty-fourth Conference on Neural Information Processing Systems.

More than 18,000 participants are anticipated at this year’s virtual gathering. In a blog post, NeurIPS 2020 organizers say they have endeavoured to ensure the virtual event is as accessible as possible for attendees in different time zones and with varied Internet speed and access.

The organizers designed a schedule with two six-hour sessions per day: the first starts at 5am PT and the second at 5pm PT. Paper authors could choose a session to make their presentations compatible with their preferred time zone. The organizers have also enabled users to choose their preferred bandwidth.

Best Paper Award Winners

Language Models are Few-Shot Learners
(NeurIPS link)
Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Institution: OpenAI

Abstract: We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

Reasons given by the awards committee: Artificial intelligence systems trained to estimate the likelihood of the next word in a sequence are known as “language models”. Language models were first described in the 1950s as a theoretical construct for connecting the then-new field of information theory with natural language. This paper describes GPT-3, the largest and most sophisticated language model ever constructed. It demonstrates that, if you make a language model accurate enough by using unprecedented amounts of compute and data, it gains the ability to solve a wide variety of tasks without additional training, using only simple, natural language prompts. Example tasks include answering trivia questions, generating essays, determining if a movie review is positive or negative, and translating between French and English. The authors note that GPT-3 is better at some tasks than others, and devote most of the paper to carefully cataloging its strengths and weaknesses. The authors also consider potentially harmful implications of the technology, such as cheap generation of almost undetectable fake news and the model’s tendency to reflect the biases of its training data on sensitive topics such as race, gender, and religion.

No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium
(NeurIPS link)
Authors: Andrea Celli (Polimi), Alberto Marchesi (Polimi), Gabriele Farina (CM) and Nicola Gatti (Polimi)
Institutions: Politecnico di Milano and Carnegie Mellon University

Authors: Andrea Celli (Polimi), Alberto Marchesi (Polimi), Gabriele Farina (CM) and Nicola Gatti (Polimi)
Institutions: Politecnico di Milano and Carnegie Mellon University

Abstract: The existence of simple, uncoupled no-regret dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form (that is, tree-form) games generalize normal-form games by modeling both sequential and simultaneous moves, as well as private information. Because of the sequential nature and presence of partial information in the game, extensive-form correlation has significantly different properties than the normal-form counterpart, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to normal-form correlated equilibrium. However, it was currently unknown whether EFCE emerges as the result of uncoupled agent dynamics. In this paper, we give the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. First, we introduce a notion of trigger regret in extensive-form games, which extends that of internal regret in normal-form games. When each player has low trigger regret, the empirical frequency of play is a close to an EFCE. Then, we give an efficient no-trigger-regret algorithm. Our algorithm decomposes trigger regret into local subproblems at each decision point for the player, and constructs a global strategy of the player from the local solutions at each decision point.

Reasons given by the awards committee: Our decisions impact others and their decisions impact us. To settle on a rational way to behave, we need to cut through this interdependence to reach what economists call an equilibrium. Creating automated procedures for finding equilibria is notoriously difficult. This paper provides the first approach for finding so-called correlated equilibria for general interactions using a learning approach. Correlated equilibria require a trusted external mediator that makes decision recommendations to the decision-makers. The canonical example of a correlated equilibrium is a stoplight. The stoplight tells approaching cars whether it is safe to go. Even in the absence of relevant laws, we should follow the stoplight’s recommendations because we know that everyone can reason that it is in their best interest to do so — -driving through the red light is a risky proposition. The paper shows that such equilibria can be arrived at by learning algorithms acting completely independently — -no external traffic engineer is needed — -even when the decisions involve multiple steps and the decision-makers are partly in the dark about the state of the world. Such an approach could have powerful implications in the modern “gig economy”, where centralized supervision of self-interested actors is the norm.

Improved Guarantees and a Multiple-Descent Curve for Column Subset Selection and the Nystrom Method
(NeurIPS link)
Authors: Michał Dereziński, Rajiv Khanna, Michael W. Mahoney
Institution: University of California, Berkeley

Abstract: The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing. A fundamental question in this area is: how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay. Our analysis also reveals an intriguing phenomenon: the approximation factor as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve. A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nystrom tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.

Reasons given by the awards committee: As the availability of large datasets expands, so does society’s dependence on being able to summarize complex data succinctly. Data summarization is the problem of identifying important examples and attributes in data to help characterize it efficiently. It can be used to select a representative subset of gene variants from a genetics dataset or the most informative documents from a text database. Prior work has shown that data summarization is an intractable problem — -there are data sets for which no known algorithm can provide a good summary in a reasonable time frame. This paper shows that these analyses are far too pessimistic. The datasets that make the data summarization problem intractable are pathological and, in fact, interpretable summaries can be generated far more cheaply for real-world data. The work suggests that future systems will be able to create data summaries that are accurate, interpretable, and efficiently generated, greatly aiding our ability to absorb and process complex datasets.

Test of Time Award Winner

The test of time award is presented to a paper from 10 years ago that has had a particularly significant and lasting impact on the AI community.

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
(NeurIPS link)
Authors: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
Institution: University of Wisconsin-Madison

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

Reasons given by the awards committee: Machine learning is the problem of turning exemplar data into a model, stored in a computer, that can be used to make decisions or take actions. At the core of modern machine-learning systems is the stochastic gradient method — usually known as “SGD” — which searches the space of possible models to find one that matches up well with the exemplar data. This paper described an implementation of SGD that can be run in parallel across a collection of fast computers, all of them making repeated small changes to the model without any coordination or synchronization. This approach, which the authors dubbed Hogwild!, outperformed alternative parallelization schemes that required synchronization. The paper also presented a theoretical analysis of Hogwild!’s convergence rate, showing that linear speedup in the number of processors could be attained (to within a constant factor) even when a large number of processors were used. The paper has been cited almost 2000 times, attesting to its influence not only on machine learning but also on the fields of computer systems and optimization, both of which contributed to the development and understanding of the Hogwild! approach.

NeurIPS 2020 continues through December 12. With 9,467 submitted papers, this has been another record-breaking year for NeurIPS — with 38 percent more paper submissions than 2019. A total of 1,903 papers were accepted, compared to 1,428 last year.

Over the course of the week, participants can virtually join the Expo, where top industry sponsors will provide talks, panels, and demos of academic interest. Tutorials will cover current lines of inquiry while general sessions will include talks, posters, and demonstrations. A full agenda can be found by visiting the NeurIPS conference schedule page.

Reporter: Yuan Yuan | Editor: Michael Sarazen

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Open AI’s GPT-3 Paper Shares NeurIPS 2020 Best Paper Award With Works from Politecnico di Milano, CMU and UC Berkeley

Best Paper Award Winners

Test of Time Award Winner

Written by Synced