Gato — the generalist agent by DeepMind
Gato is capable of performing +600 tasks — play video games, chat, caption images and move robotic arms in real-world & simulated environments.
Gato manages multi-modal, multi-task and multi-environment — within single, generalist agent.
DeepMind trained this generalist with 604 different tasks. It outperformed expert level in 450 of them — at least 50% of time.
The model is transformative in terms of diversity of tasks it manages to perform. It can manage tasks like:
- Playing video games
- Text caption images
- Control robot arms
The model uses just a single architecture based on Transformer. It is a 1.18B parameter model with shared embeddings.
The agent is trained using the following data:
- 85.3% is control environment data — think DMLab, Ale Atari, BabyAI and Meta-World as few examples of the wide variety.
- 14.7% is vision & language datasets — mainly MassiveText & M3W.
These inputs include data like pressing button, robot arm movements, text and images from a real world or simulated environments.
The Gato model is 100x smaller than GPT-3 model or 1000x smaller than the biggest NLP models in terms of parameters. It is visible in its text generation capabilities being inferior to models like the GPT-3.
We are likely to see soon similar models scaled larger in terms of data, parameters, compute and the networking.
We expect to see this lead to even better performance in the benchmarks.
It may allow as well more transfer learning between tasks.
The main challenge are the computing and networking limitations. For example the authors referred to design decisions deriving sometimes from the network limitations.
The multi modality learning is a key trend in the AI.