WorldPrediction: A Benchmark for High-Level World Modeling and Long-Horizon Procedural Planning

World models predict future world states that result from actions, enabling AI agents to perform planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities in different models. In contrast to prior works that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with high temporal and semantic abstraction and long-horizon procedural planning in skilled human activities. Given initial and final world states, the task is to distinguish the proper action or the properly ordered action sequence from a set of counterfactual distractors. This discriminative task setup enable us to incorporate different types of world models and planners and realize a thorough comparison across different hypothesis.

The benchmark represents all states and actions using visual observations. To prevent models from exploiting low-level continuity cues in background scenes, we provide ``action equivalents'' -- identical actions recorded in different contexts -- as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, which ensures better reliability and robustness of the evaluation. We report both model performance baselines and human baselines to establish initial reference points on WorldPrediction.

WorldPrediction: A Benchmark for High-Level World Modeling and Long-Horizon Procedural Planning

Abstract

BibTeX