WorldPrediction: A Benchmark for High-Level World Modeling and Long-Horizon Procedural Planning

1Meta FAIR 2HKUST 3ISIR, Sorbonne Université

Abstract

World models predict future world states that result from actions, enabling AI agents to perform planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities in different models. In contrast to prior works that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with high temporal and semantic abstraction and long-horizon procedural planning in skilled human activities. Given initial and final world states, the task is to distinguish the proper action or the properly ordered action sequence from a set of counterfactual distractors. This discriminative task setup enable us to incorporate different types of world models and planners and realize a thorough comparison across different hypothesis.

The benchmark represents all states and actions using visual observations. To prevent models from exploiting low-level continuity cues in background scenes, we provide ``action equivalents'' -- identical actions recorded in different contexts -- as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, which ensures better reliability and robustness of the evaluation. We report both model performance baselines and human baselines to establish initial reference points on WorldPrediction.

BibTeX

@article{TBD,
  author    = {Atuhor et al.,},
  title     = {WorldPrediction: A Benchmark for High-Level World Modeling and Long-Horizon Procedural Planning},
  journal   = {Conference},
  year      = {Year},
}