Astra
[ICLR 2026] Astra : General Interactive World Model with Autoregressive Denoising"
Install / Use
/learn @EternalEvan/AstraREADME
Astra<img src="./assets/images/logo.png" alt="logo" style="height: 1em; vertical-align: baseline; margin: 0 0.1em;">: General Interactive World Model with Autoregressive Denoising (ICLR 2026)
<div align="center"> <div style="margin-top: 0; margin-bottom: -20px;"> <img src="./assets/images/logo-text-2.png" width="50%" /> </div> <h3 style="margin-top: 0;"> 📄 [<a href="https://arxiv.org/pdf/2512.08931" target="_blank">arXiv</a>] 🏠 [<a href="https://eternalevan.github.io/Astra-project/" target="_blank">Project Page</a>] 🤗 [<a href="https://huggingface.co/EvanEternal/Astra" target="_blank">Huggingface</a>] </h3> </div> <div align="center">Yixuan Zhu<sup>1</sup>, Jiaqi Feng<sup>1</sup>, Wenzhao Zheng<sup>1 †</sup>, Yuan Gao<sup>2</sup>, Xin Tao<sup>2</sup>, Pengfei Wan<sup>2</sup>, Jie Zhou <sup>1</sup>, Jiwen Lu<sup>1</sup>
<!-- <br> -->(† Project leader)
<sup>1</sup>Tsinghua University, <sup>2</sup>Kuaishou Technology.
</div>📖 Introduction
TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.
Astra is an interactive, action-driven world model that predicts long-horizon future videos across diverse real-world scenarios. Built on an autoregressive diffusion transformer with temporal causal attention, Astra supports streaming prediction while preserving strong temporal coherence. Astra introduces noise-augmented history memory to stabilize long rollouts, an action-aware adapter for precise control signals, and a mixture of action experts to route heterogeneous action modalities. Through these key innovations, Astra delivers consistent, controllable, and high-fidelity video futures for applications such as autonomous driving, robot manipulation, and camera motion.
<div align="center"> <img src="./assets/images/pipeline.png" alt="Astra Pipeline" width="90%"> </div>Gallery
Astra+Wan2.1
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/715a5b66-3966-4923-aa00-02315fb07761" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/1451947e-1851-4b57-a666-a44ffea7b10c" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/c7156c4d-d51d-493c-995e-5113c3d49abb" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/f7550916-e224-497a-b0b9-84479607c962" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/d899d704-c706-4e64-a24b-eea174d2173d" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/c1d8beb2-3102-468a-8019-624d89fba125" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/2aabc10b-f945-4d9d-b24a-baed17fcfe14" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/5c03e6ae-0fc2-4e09-a5b5-f37d04e7bbf8" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> </tr> </table>🔥 News
- [2026.1.26]: Our paper has been accepted to ICLR 2026! 🎉
- [2025.11.17]: Release the project page.
- [2025.12.09]: Release the inference code, model checkpoint.
🎯 TODO List
-
[x] Release dataset preprocessing tools
-
[ ] Release full inference pipelines for additional scenarios:
- [ ] 🚗 Autonomous driving
- [ ] 🤖 Robotic manipulation
- [ ] 🛸 Drone navigation / exploration
-
[ ] Open-source training scripts:
- [x] ⬆️ Action-conditioned autoregressive denoising training
- [ ] 🔄 Multi-scenario joint training pipeline
-
[ ] Provide unified evaluation toolkit
⚙️ Run Astra (Inference and Training)
Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:
Inference
Step 1: Set up the environment
DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"
Install DiffSynth-Studio:
git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .
Step 2: Download the pretrained checkpoints
- Download the pre-trained Wan2.1 models
cd script
python download_wan2.1.py
- Download the pre-trained Astra checkpoint
Please download from huggingface and place it in models/Astra/checkpoints.
Step 3: Test the example image
python infer_demo.py \
--dit_path ../models/Astra/checkpoints/diffusion_pytorch_model.ckpt \
--wan_model_path ../models/Wan-AI/Wan2.1-T2V-1.3B \
--condition_image ../examples/condition_images/garden_1.png \
--cam_type 4 \
--prompt "A sunlit European street lined with historic buildings and vibrant greenery creates a warm, charming, and inviting atmosphere. The scene shows a picturesque open square paved with red bricks, surrounded by classic narrow townhouses featuring tall windows, gabled roofs, and dark-painted facades. On the right side, a lush arrangement of potted plants and blooming flowers adds rich color and texture to the foreground. A vintage-style streetlamp stands prominently near the center-right, contributing to the timeless character of the street. Mature trees frame the background, their leaves glowing in the warm afternoon sunlight. Bicycles are visible along the edges of the buildings, reinforcing the urban yet leisurely feel. The sky is bright blue with scattered clouds, and soft sun flares enter the frame from the left, enhancing the scene’s inviting, peaceful mood." \
--output_path ../examples/output_videos/output_moe_framepack_sliding.mp4 \
This inference can be conducted on a single 24GB GPU, such as the NVIDIA 3090.
Step 4: Test your own images
To test with your own custom images, you need to prepare the target images and their corresponding text prompts. We recommend that the size of the input images is close to 832×480 (width × height, 16:9), which is consistent with the resolution of the generated video and can help achieve better video generation effects. For prompts generation, you can refer to the Prompt Extension section in Wan2.1 for guidance on crafting the captions.
python infer_demo.py \
--dit_path path/to/your/dit_ckpt \
Related Skills
docs-writer
98.8k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
331.7kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
arscontexta
2.8kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
