Gamma-World:
Generative Multi-Agent World Modeling
Beyond Two Players
Gallery
γ-World Overview
A comprehensive overview of γ-World: interactive multi-agent world generation across diverse scenes and configurations.
Two-Agent Interaction
Qualitative results of two-agent interaction. Each agent is independently controllable while sharing the same evolving world.
Four-Agent Generalization
Benefiting from the permutation-symmetric simplex agent encoding, γ-World generalizes from two to four players without additional training.
Real-World Robotics Coordination
γ-World extends to real-world multi-robot coordination scenarios, demonstrating practical applicability beyond virtual environments.
Abstract
World models for interactive video generation have largely focused on single-agent settings, where future observations are rolled out from a single action stream, user input, or controllable viewpoint. However, many simulated worlds are inherently populated: multiple players, robots, or embodied agents act simultaneously within a shared, evolving environment. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives.
In this paper, we present γ-World, a generative multi-agent world model for interactive simulation. γ-World introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering.
To support efficient cross-agent interaction, we further propose Sparse Hub Attention, where learnable hub tokens mediate communication across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. Finally, we use a bidirectional multi-agent teacher to guide a block-causal student with distillation, after which the final causal model can use KV caching for streaming, achieving real-time action-responsive rollouts at 24 FPS.
Experiments in multiplayer virtual environments show that γ-World improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.
Method
Architecture overview. γ-World takes per-agent action streams and produces a shared, multi-view rollout. Two key designs make it scalable to many agents:
Simplex Rotary Agent Encoding
A parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. Each agent receives a distinct phase while remaining permutation-equivalent, eliminating the need for learned per-slot identities or a fixed agent ordering.
Sparse Hub Attention
Learnable hub tokens mediate communication across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents — enabling efficient scaling to four or more agents.
Efficiency: Sparse Hub Attention
Sparse Hub Attention scales linearly with the number of agents, while dense attention scales quadratically.