Voyager for Minecraft Under the Hood

An overview of Voyager Minecraft AI agent

Alexey Potapov

Published in

TrueAGI

6 min readJul 7, 2023

Introduction

In the previous posts we introduced prototypes of our goal-oriented and exploring agents and discussed limitations of imperative agents.

In this post, we analyze capabilities and behavior of Voyager, which is a GPT-4-powered agent capable of exploratory behavior and achieving such goals as acquiring a diamond.

A Hgh-Level Overview of Voyager

Voyager uses explicit symbolic representation of its goals and it has a library of skills programmatically implemented as pieces of code. It infers what skills to use to achieve goals or to satisfy conditions of applying other skills. Usage of information about the Minecraft world to achieve explicit goals via subsymbolic skills is similar in general to our agents much more than to end-to-end reinforcement learning.

The difference is that Voyager uses GPT-4 both to learn new skills and to orchestrate them that sounds quite advanced. While our agents also utilize machine learning for training the vision subsystem and some basic motor skills, they don’t generate brand new skills. They also use quite a limited knowledge base, while GPT-4 has vast knowledge about Minecraft world including not only game rules, receipts, information about drops from mining, etc., but also tricks, techniques, and good practices from different tutorials, which discovery is based on extensive experience.

It should also be noted that Voyager uses Mineflayer, which has quite high-level path-finding functions, as well as own additional handcrafted control primitives. Thus, both in terms of skills and prior knowledge, it is very far from learning from scratch. But while the latter is interesting in context of certain research, such usage of GPT-4 can be justified from the utility standpoint. As we stated in the mentioned blog posts, Minecraft is complex enough and any means (which human players don’t consider to be too cheaty) of achieving good results in it are fair.

Digging Deeper

First of all, let’s note that learning a library of skills by Voyager is an expensive process, which requires a lot of calls to GPT-4 to generate valid code for useful skills (and in fact, GPT-3.5 was initially used for a considerable part of calls during training to reduce the cost).

Let’s consider what is learned on the following example (see the library of pretrained skills).

async function mineFiveCoalOres(bot) {
  // Equip the wooden pickaxe
  const woodenPickaxe = bot.inventory.findInventoryItem(mcData.itemsByName.wooden_pickaxe.id);
  await bot.equip(woodenPickaxe, "hand");

  // Find 5 coal_ore blocks
  const coalOres = await exploreUntil(bot, new Vec3(1, 0, 1), 60, () => {
    const coalOres = bot.findBlocks({
      matching: block => block.name === "coal_ore",
      maxDistance: 32,
      count: 5
    });
    return coalOres.length >= 5 ? coalOres : null;
  });
  if (!coalOres) {
    bot.chat("Could not find enough coal ores.");
    return;
  }

  // Mine the 5 coal_ore blocks
  await mineBlock(bot, "coal_ore", 5);
  bot.chat("5 coal ores mined.");
}

Here, the main function is findBlocks from Mineflayer (and findInventoryItem as well), and exploreUntil and mineBlock are functions from hand-coded control primitives of Voyager, which wrap pathfinder, collectBlock, and other utilities from Mineflayer. While the code of mineFiveCoalOres is pretty lengthy, it simply says “Find 5 coal ore blocks, mine and collect them” in a purely imperative way.

As we discussed, there are different problems with the imperative approach. For example, the code requires creating a wooden pickaxe, even if the agent has a stone pickaxe, which usage might typically be more appropriate. Reasoning based on declarative knowledge would not require to indicate what tool should be used in this particular skill. A piece of knowledge that coal ore can be mined with any pickaxe is enough and can be used in any other skill or reasoning context. This is the strength of modularity and compositionality of knowledge.

Let’s explore the list of skills in the library: collectBamboo, collectFiveCactusBlocks, … cookPorkchops, cookSevenMutton, … craftBucket, craftChest, craftIronAxe, (and huge number of other crafting skills), equipIronSword, …mineFiveCoalOres, mineFiveIronOres, … smeltFiveRawIron, … One skill that sounds more interesting is fillBucketWithWater, but it is also a pretty straightforward wrapper of exploreUntil, findBlock, pathfinder, etc. The main question is — do we really need all these “skills”? An agent capable of declarative reasoning can figure out on fly how to do any of these tasks represented as goals without expensive training, search, or huge tensor calculations. It should also be noted that a lot of examples of skills (i.e., pieces of code) are fed to GPT-4 in order to make GPT-4 capable of writing similar skills.

Another major issue with competing all these tasks via stand-alone imperative skills is that execution of each of them is blocking. Voyager doesn’t reason, plan and replan at runtime taking the current situation into account. Humans possess much finer-grained symbolic control over motor skills. If the agent is executing mineFiveCoalOres in context of larger imperative skill of crafting iron pickaxe for getting a diamond and runs near a diamond item lying on the ground, it will just ignore this diamond. Such an imperative skill will not ever try to pick a diamond before mining diamond ore if such the possibility is not explicitly coded. This behavior looks not too intelligent, but does it matter if Voyager can acquire a diamond?

Field Trials

Unexpected situations leading to impasses appear more frequently than that leading to shortcuts, and imperative skills are prone to getting stuck, and one can wonder how Voyager overcomes this. Of course, Voyager doesn’t try to come up with exploreUntil function by itself, and uses a polished handcrafted code. Otherwise, it would get stuck too often. It also executes skills for a limited amount of time, so it doesn’t stay stuck forever. However, such cases are not infrequent. For example, Voyager got stuck for almost 10 minutes even using the teleport cheat (for local movement) in our third run.

Voyager got stuck for 10 minutes even using the teleport cheat

It also appears that Voyager uses findBlocks with maxDistance=32 meaning that it sees through all other blocks and can “notice” coal, iron or diamond ore at 32 blocks distance, so it doesn’t need to search for blocks, but goes directly to them with the use of Mineflayer’s pathfinder. There is a chance of the block to be not found, and Voyager relocates to make a try in a different place, but this is also hand-coded.

If we allow our agent to “notice” iron and diamond ore (but not other blocks) similar to findBlocks without teleport and other cheats, it will be able to acquire a diamond faster than Voyager.

Voyager obtaining a diamond (frames when the server is paused are not recorder; real time is about 1 hour)

Eventually, is acquiring a diamond by Voyager a considerable achievement? We would say — not really. It is not practical at the moment. Voyager pauses the Minecraft server and spends an order of magnitude more time on thinking that on playing the game itself. And it also costs money for GPT-4 API calls. Its training is much more costly than writing a better skill library manually (especially taking into account that examples of skills are written manually), and still its behavior is far from perfect even with cheats. Its architecture doesn’t seem like a step towards AGI as well. It still has its merit as an example of:

Usage of information possesses by GPT-4 about specific domain;
A capability of GPT-4 to write programs with calls to custom libraries described via prompts;
A possibility of such usage of GPT-4 that connects domain knowledge to program code.

It would be interesting if GPT-4 could make a real contribution as a part of neural-symbolic architecture with declarative knowledge and reasoning component.

Special thanks to:

Sergey Rodionov for testing Voyager and preparing the videos;
Anatoly Belikov for developing and maintaining Vereya mod;
Innokentiy Zhdanov for extending functionality of our agent and recording its video.