One Model To Solve All Problems: The Story of Lukasz Kaiser

Published in

AI Frontiers

7 min readOct 15, 2018

Łukasz Kaiser joined Google Brain in 2013. He moved from French National Center for Scientific Research. At Google Brain, he co-designed neural models for machine translation, parsing and other algorithmic and generative tasks and co-authored the TensorFlow system and the Tensor2Tensor library.

Advanced seq2seq learning

Deep learning researchers started eyeing natural language processing, an AI research field concerned with human-machine interaction with languages. It was, however, a difficult challenge.

“When neural networks first came out, it’s built for image recognition to process inputs with the same dimension of pixels. Sentences are not the same as images.” Lukasz says.

The magic of deep learning did not happen in NLP until Google Brain Researcher Ilya Sutskever, Oriol Vinyals, and Quoc Le proposed sequence-to-sequence learning in their 2014 paper Sequence to Sequence Learning with Neural Networks. It is an end-to-end encoder-decoder architecture built on the recurrent neural network (RNN) and long-term short memory (LSTM) to map the sequential data like texts and documents into a fixed-length vector.

That means instead of knowing anything about grammars and words, a neural network can be trained only by writing sequences — as long as the network has enough training data. Below is an example of converting the parse tree into a sequence.

However, the model was far from ideal: The model worked poorly when it was trained on standard human-annotated parsing datasets like 1M tokens, and plagued with issues like data inefficiency.

Three months after the seq2seq learning paper was published, Kaiser and his Google Brain fellows took a step forward by proposing an attention-enhanced seq2seq learning model, which achieves state-of-the-art results when it was trained on a large synthetic corpus. The attention mechanism was discovered to be an important extension that let models pay more attention to some keywords of each sentence. As a result, the model can handle long sentences well while delivering the same-level performance with a relatively small dataset. Other approaches for model training on a small dataset without sacrificing the performance could be dropout, confidence penalty and layer normalization.

The result was impressive: The new model trained on a small human-annotated parsing dataset is able to match the performance of a standard parser like BerkeleyParser. When the model was trained on a dataset with only high-confidence parse trees, it achieved an F1 score of 92.5 on section 23 of the WSJ — a new state-of-the-art.

The Sutskever and Kaiser’s research laid the groundwork for the Google Neural Machine Translation (GNMT), which is an end-to-end learning system for automated translation. Launched in September 2016, Google Translation began using NMT in preference to statistical methods. It improves the quality of translation by learning from millions of examples. Today, Google Translate can support over 100 languages.

So how good is Google Translate? On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a side-by-side human evaluation on a set of separate simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

Attention is all you need

The encoder-decoder architecture based on recurrent neural network or convolutional neural network has long dominated the way of building sequence models. But there are two problems with it: The sequential nature of the recurrent neural network makes parallel computing impossible and increases computational costs and training time; It is difficult to learn dependencies between distant positions.

“When I go to see my old friends, I will start talking to him about things we talked about 20 years ago. I immediately recalled what’s needed there. This thing, which many people called long-term dependencies seems to be like a one really important thing to tackle,” said Kaiser.

In 2017, Kaiser’s research team teamed up with the University of Toronto and released the paper Attention Is All You Need. Basically, the paper proposed a new simple network architecture, the Transformer, based solely on attention mechanisms. The new model, which is still an encoder-decoder, gets rid of any recurrent or convolutional building blocks.

The Transformer uses two type of attention functions: Scaled Dot-Product attention, which computes the attention function on a set of queries simultaneously, packed together into a matrix; and Multi-head attention, which is necessarily a stack of multiple attention layers that allows the model to jointly attend to information from different representation subspaces at different positions.

As a result, the Transformer can be trained significantly faster than architectures based on previous seq2seq learning models. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, the model achieves a new state of the art.

I just need one model

Kaiser witnessed the success of deep learning in many tasks, but deep learning is not going to achieve the general intelligence that Kaiser and many AI researcher might spend their whole lives pursuing. Why? Because it is only good at one task. The human, by comparison, has a flair for generalizing one skill to many other tasks.

“Can we create a unified deep learning model to solve tasks across multiple domains?” This is the question that arises in his mind.

In 2017, the Kaiser’s team released a joint research paper One Model To Learn Them All with University of Toronto, which presented MultiModel, a single model that yields good results on a number of problems spanning multiple domains, including image classification, multiple translation tasks, image captioning, speech recognition, and English parsing task. Tesla Director of AI Andrej Karpathy tweeted “One Model To Learn Them All is another step in Google’s attempt to turn all of itself into one big neural network.”

The model brings together the building blocks from different neural network architectures, such as convolutional layers, the attention mechanism, parsely-gated mixture-of-experts layers, and encoder-decoder. It includes four modality nets, for language (text data), images, audio, and categorical data.

Also worth noting, the paper suggests training one task (like parsing) with a seemingly unrelated dataset (like ImageNet), would bring improvements, thanks to transfer learning, a machine learning method to deal with multiple tasks. The authors further explained that “there are computational primitives shared between different tasks that allow for some transfer learning even between such seemingly unrelated tasks as ImageNet and parsing.”

While its results do not improve over the state-of-the-art (for example, 86% accuracy vs. state-of-the-art with 95% accuracy in ImageNet), the paper demonstrates for the first time that a single deep learning model can jointly learn many large-scale tasks from multiple domains.

Tensor2Tensor

Kaiser is a major contributor to the development of Google’s open-source library TensorFlow for large-scale machine learning. Coming with many convenience functionalities and utilities, TensorFlow is now the world’s most prominent machine learning system among researchers and application developers. It uses data flow graphs to represent computation, maps the nodes of a data flow graph across many machines in a cluster, and connects with a wide range of computational devices, including CPUs, GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs).

However, TensorFlow still has room for improvement considering an easy-to-use perspective. “TensorFlow is now used by many people, It is a great system, at least the foundations for machine learning. We found it is still quite hard for people to get into machine learning, start their first model, get their system working.”

Following the MultiModel research, Kaiser’s team announced a release of an open sourced repository on Github, named Tensor2Tensor, or T2T for short. It is a library of deep learning models and datasets using TensorFlow throughout with a strong focus on making deep learning more accessible as well as accelerating ML research.

Tensor2Tensor currently supports CPU, GPU, and TPU in single and multi-device configurations. It also suggests a set of hyperparameters that Google is confident of their performance, such as the number of hidden layers or the optimizer’s learning rate.

The repository packaged many easy-to-use datasets used in academia as well as state-of-the-art deep learning models (in forms of codes) for tasks from image processing problems like classification and generation to speech recognition to NLP problems like summarization and translation. Users don’t need to know where to download them or how to preprocess the data. It’s all in the code.

Kaiser has been thinking of a bigger picture of general intelligence. With a model that can solve different tasks, he believes he is approaching the future AI. “Does this model understand the world? Does it really give us something more general than the specific intelligence we have now? It’s very hard to answer but we are on a path, and maybe in a few years, we can say more.”

In the upcoming AI Frontiers Conference held from November 9 to November 11, Kaiser will present a tutorial of Seq2seq learning with T2T. The hands-on tutorial will show how to use the open-source T2T library to train state-of-the-art models for translation, image generation, and any task of your choice.

AI Frontiers Conference brings together AI thought leaders to showcase cutting-edge research and products. This year, our speakers include: Ilya Sutskever (Founder of OpenAI), Jay Yagnik (VP of Google AI), Kai-Fu Lee (CEO of Sinovation), Mario Munich (SVP of iRobot), Quoc Le (Google Brain), Pieter Abbeel (Professor of UC Berkeley) and more.

Buy tickets at aifrontiers.com. For question and media inquiry, please contact: info@aifrontiers.com

One Model To Solve All Problems: The Story of Lukasz Kaiser

Advanced seq2seq learning

Attention is all you need

I just need one model

Tensor2Tensor

Written by AI Frontiers