SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models

NeurIPS 2025

Sen Wang¹, Jingyi Tian¹, Le Wang¹†, Zhimin Liao¹, Jiayi Li¹, Huaiyi Dong¹, Kun Xia¹, Sanping Zhou¹, Wei Tang², Hua Gang³

¹ National Key Laboratory of Human-Machine Hybrid Augmented Intelligence,
National Engineering Research Center for Visual Information and Applications,
Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University

² University of Illinois at Chicago

³ Amazon.com, Inc.

† Corresponding author

Abstract

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism.

Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

Pipeline

The overall framework of SAMPO. The observed and future frames are discretized by a multi-scale tokenizer to obtain dense and sparse token maps, which are then autoregressively predicted across time, while following a coarse-to-fine decoding order within each frame. Motion prompts extracted from observed frames are injected alongside visual tokens to guide dynamic modeling.

BibTeX

@article{wang2025sampo, title={SAMPO: Scale-wise Autoregression with Motion PrOmpt for generative world models}, author={Wang, Sen and Tian, Jingyi and Wang, Le and Liao, Zhimin and Li, Jiayi and Dong, Huaiyi and Xia, Kun and Zhou, Sanping and Tang, Wei and Gang, Hua}, journal={arXiv preprint arXiv:2509.15536}, year={2025} }

SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models

NeurIPS 2025

SAMPO is a scale-wise autoregressive world model for video prediction and robotic control.

Abstract

Pipeline

Video prediction results on BAIR and RoboNet

Visual planning performance in VP^2

Ablation Studies & Model Analysis

Visualization Examples

BibTeX