A New Frontier of AI? Language-Cognizant Agents Prompting Video Generation Imagination Models

nullonesix
Artificialis
Published in
5 min readFeb 13, 2023

In my last article I talked about how efficient image generation allows for prompting of video generation models to say “produce a video of a player expertly completing the first level of Mario” or “Mario jumping over a chasm” and how language can be used to either expand or summarize on a topic, thereby providing a means of navigating abstract action spaces at multiple resolutions.

I want to expand further on this now that we have AI’s making movies.

Let’s take Tetris as an example. Everyone knows Tetris and most might mistakenly think it’s an easy game for AI to master. It’s easy if you cheat, say, by telling the AI to avoid holes in the structure. But if the AI is given no a priori knowledge about Tetris and is simply asked to maximize the number of line clears, learning can be quite challenging. Typically AI research papers on Tetris will simplify it, say by making the game region 10x10 instead of 20x10 or using smaller pieces.

That being said, modern AI has progressed so that using probabilistic game tree search and giving the AI access to the game engine can result in decent performance. Now this is still cheating a little bit because you’re giving it access to the game engine, but there alternatives to this that will probably work somewhat:

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games — the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.

So suppose we have MuZero starting to learn to play Tetris. It’s still going to be embarrassingly inefficient while doing so and hence ill-prepared for real-world challenges such as maximizing the amount of money it earns.

Can the situation be made better by using video generation models? Well, yes we can prompt a video generation model for an expert demonstration of Tetris playing, extract the game actions from this demonstration, and then take those actions.

But this is only the most basic kind of synthesis. Besides, what if we would like superhuman performance? Moreover, what if we want the model to generate its own prompts?

Well, we can always bootstrap off of human demonstrations and then reinforcement learn from there to get superhuman performance, but again the reinforcement learning part is going to be inefficient. What we’re really after is an efficient reinforcement learning that is holistically integrated with language modelling and video generation, so that the agent can use natural language to create thought experiments it can learn from, explain its own actions and listen to instructions.

Linking language to world objects or phenomenon is called the symbol grounding problem and there’s been some progress in this area circa 2020:

Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few- and one-shot learning. Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt (“This is a dax”), the agent can re-identify the object and manipulate it as instructed (“Put the dax on the bed”). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word “dax” with long-term lexical and motor knowledge acquired across episodes (i.e. “bed” and “putting”). We find that, under certain training conditions and with a particular memory writing mechanism, the agent’s one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for ‘fast-mapping’, a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.

Let’s take a moment to step back a bit. You can say we have 3 components:

  1. Language in the form of language models.
  2. Imagination in the form of video generation.
  3. Agency in the form of reinforcement learning.

We should be able to prompt agents the way we prompt video generation and language models. That is, we should be able to say “learn to play Tetris optimally and give me a demonstration of this strategy” to the agent.

What happens next?

The language component generates code for an objective function, which for Tetris should be something like the number of lines cleared. The imagination component generates a video of optimal Tetris play. The agency component imitates this video and then engages in prompt tuning with the language and imagination components, trying to generate better imaginings of optimal play that are still imitable. Then this whole system is trained together across a variety of games. Finally, you have a truly prompt-able AI, one that can not only write your code for you or make beautiful art, but also play video games for you.

End notes:

  • there’s still a lot missing with respect to optimal synthesis between the 3 components
  • imagination component ought to be updated with new discoveries from the agency component
  • language and imagination components can simply be thought of as stores of knowledge, but then so can the agency component
  • the key takeaways are that RL agents can be prompt-able, that video generation can be equated with imagination, and that there are 3 components (agency, language, imagination) denoted by common place words which can help one think about the big picture of their interactions
  • integrating these components has a different feel to it than when doing typical reinforcement learning
  • Tetris is amusingly tricky and optimal Tetris play with no a priori knowledge is still an open problem for AI

Comment below with your thoughts! I will be working on this for the near future so I look forward to hearing them!

--

--