Paper Explained: Reinforcement Learning Friendly Vision-Language Model for Minecraft (CLIP4MC)
Recently, I came across an interesting application of CLIP, a Vision-Language Model (VLM), in defining a reward function to enable a Minecraft agent to learn actions using Reinforcement Learning. Since this paper lies at the intersection of my interests — VLM and Reinforcement Learning — I wanted to explain it here.
Introduction
Minecraft is a sandbox game that offers a vast environment for AI-driven exploration and task completion. The challenge in building autonomous agents for Minecraft lies in defining effective reward functions without manual engineering or human feedback, which is both costly and time-consuming.
CLIP4MC, a reinforcement learning (RL) friendly vision-language model, enhances reward learning for Minecraft tasks by improving task completion awareness and video-text alignment. This blog explores CLIP4MC’s advancements over MineCLIP and its impact on RL training.
Problem Statement
Developing an autonomous embodied agent that excels across a wide spectrum of tasks in Minecraft is challenging due to:
- The impracticality of manually designing rewards for all possible tasks.
- The high cost of learning a reward model from human feedback.
Background: MineCLIP and Its Limitations
MineDojo introduced MineCLIP, a vision-language model that leverages internet-scale multimodal knowledge bases to provide reward functions for reinforcement learning. It uses a correlation score between visual observations and language prompts as a dense reward signal, eliminating the need for task-specific reward design.
However, MineCLIP has limitations:
- It does not ensure that target entities are within the agent’s field of vision, leading to poor reward scores.
- It lacks task completion awareness, making rewards less informative.
CLIP4MC: Key Improvements
CLIP4MC enhances MineCLIP with:
- Enhanced Data Processing: A robust filtering and cleaning pipeline ensures better text-video alignment.
- Task Completion Awareness: Rewards are adjusted based on how well an action contributes to goal completion.
CLIP4MC Architecture
CLIP4MC refines video-text alignment using a structured architecture:
- Video Encoder: Extracts meaningful representations from video frames.
- Text Encoder: Converts textual descriptions into embeddings.
Contrastive Learning and Task Completion Weighting
CLIP4MC employs contrastive loss to align video and text representations:
Additionally, task completion weighting ensures that the similarity score increases as the agent progresses toward task completion:
Training Pipeline
Reward Generation
Dataset Details
The dataset originates from YouTube videos in MineDojo and undergoes:
- Transcript cleaning.
- Keyword-based filtering.
- Video partitioning and selection.
- Ensuring text-video alignment.
Features include: Video ID, Frame sequence size, Video transcript (caption/prompt), Start and end positions of cropped clips.
Experimental Setup
CLIP4MC was tested on eight programmatic tasks from MineDojo and compared against:
- MineCLIP [official].
- MineCLIP [Ours]: Trained on a cleaned dataset without the swap operation.
Training details:
- 640K video-text pairs.
- 16 equidistant RGB frames per sample.
- Cosine learning rate annealing with 320 warm-up gradient steps.
- Fine-tuning only the last two layers of pre-trained CLIP encoders.
- Training performed on a 4×A100 GPU node with FP16 mixed precision.
Evaluation Metrics
CLIP4MC was evaluated on:
- Success Rate: Percentage of tasks successfully completed.
- Reward Correlation: Alignment between intrinsic rewards and task completion.
- Retrieval Performance: Accuracy in matching text descriptions with relevant video segments.
Results
Key Findings
- Higher RL Success Rate: CLIP4MC outperforms MineCLIP in Minecraft task completion.
- Improved Task Understanding: More accurate video-text alignment.
- Stronger Reward Learning: Higher correlation between rewards and task completion.
Reward Analysis
Scatter plots indicate that CLIP4MC rewards better correlate with entity size and task progress, with Pearson correlation coefficients of 0.81, 0.66, and 0.62. This suggests that CLIP4MC captures task completion more effectively than MineCLIP.
Conclusion
CLIP4MC significantly enhances reward learning for reinforcement learning in open-ended environments like Minecraft. By integrating improved data processing, task completion awareness, and contrastive loss, it overcomes the limitations of MineCLIP. This results in better reward modeling, improved task success rates, and more precise video-text alignment.
For researchers and developers in embodied AI and reinforcement learning, CLIP4MC represents a robust and scalable approach to autonomous agent training in Minecraft and beyond.