SkillAgentSearch skills...

Astra

[ICLR 2026] Astra : General Interactive World Model with Autoregressive Denoising"

Install / Use

/learn @EternalEvan/Astra
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Astra<img src="./assets/images/logo.png" alt="logo" style="height: 1em; vertical-align: baseline; margin: 0 0.1em;">: General Interactive World Model with Autoregressive Denoising (ICLR 2026)

<div align="center"> <div style="margin-top: 0; margin-bottom: -20px;"> <img src="./assets/images/logo-text-2.png" width="50%" /> </div> <h3 style="margin-top: 0;"> 📄 [<a href="https://arxiv.org/pdf/2512.08931" target="_blank">arXiv</a>] &nbsp;&nbsp; 🏠 [<a href="https://eternalevan.github.io/Astra-project/" target="_blank">Project Page</a>] &nbsp;&nbsp; 🤗 [<a href="https://huggingface.co/EvanEternal/Astra" target="_blank">Huggingface</a>] </h3> </div> <div align="center">

Yixuan Zhu<sup>1</sup>, Jiaqi Feng<sup>1</sup>, Wenzhao Zheng<sup>1 †</sup>, Yuan Gao<sup>2</sup>, Xin Tao<sup>2</sup>, Pengfei Wan<sup>2</sup>, Jie Zhou <sup>1</sup>, Jiwen Lu<sup>1</sup>

<!-- <br> -->

(† Project leader)

<sup>1</sup>Tsinghua University, <sup>2</sup>Kuaishou Technology.

</div>

📖 Introduction

TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.

Astra is an interactive, action-driven world model that predicts long-horizon future videos across diverse real-world scenarios. Built on an autoregressive diffusion transformer with temporal causal attention, Astra supports streaming prediction while preserving strong temporal coherence. Astra introduces noise-augmented history memory to stabilize long rollouts, an action-aware adapter for precise control signals, and a mixture of action experts to route heterogeneous action modalities. Through these key innovations, Astra delivers consistent, controllable, and high-fidelity video futures for applications such as autonomous driving, robot manipulation, and camera motion.

<div align="center"> <img src="./assets/images/pipeline.png" alt="Astra Pipeline" width="90%"> </div>

Gallery

Astra+Wan2.1

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/715a5b66-3966-4923-aa00-02315fb07761" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/1451947e-1851-4b57-a666-a44ffea7b10c" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/c7156c4d-d51d-493c-995e-5113c3d49abb" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/f7550916-e224-497a-b0b9-84479607c962" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/d899d704-c706-4e64-a24b-eea174d2173d" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/c1d8beb2-3102-468a-8019-624d89fba125" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/2aabc10b-f945-4d9d-b24a-baed17fcfe14" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> <td> <video src="https://github.com/user-attachments/assets/5c03e6ae-0fc2-4e09-a5b5-f37d04e7bbf8" style="width:100%; height:180px; object-fit:cover;" controls autoplay loop muted></video> </td> </tr> </table>

🔥 News

  • [2026.1.26]: Our paper has been accepted to ICLR 2026! 🎉
  • [2025.11.17]: Release the project page.
  • [2025.12.09]: Release the inference code, model checkpoint.

🎯 TODO List

  • [x] Release dataset preprocessing tools

  • [ ] Release full inference pipelines for additional scenarios:

    • [ ] 🚗 Autonomous driving
    • [ ] 🤖 Robotic manipulation
    • [ ] 🛸 Drone navigation / exploration
  • [ ] Open-source training scripts:

    • [x] ⬆️ Action-conditioned autoregressive denoising training
    • [ ] 🔄 Multi-scenario joint training pipeline
  • [ ] Provide unified evaluation toolkit

<!-- ## 🚀 Trail: Try ReCamMaster with Your Own Videos **Update:** We are actively processing the videos uploaded by users. So far, we have sent the inference results to the email addresses of the first **1180** testers. You should receive an email titled "Inference Results of ReCamMaster" from either jianhongbai@zju.edu.cn or cpurgicn@gmail.com. Please also check your spam folder, and let us know if you haven't received the email after a long time. If you enjoyed the videos we created, please consider giving us a star 🌟. **You can try out our ReCamMaster by uploading your own video to [this link](https://docs.google.com/forms/d/e/1FAIpQLSezOzGPbm8JMXQDq6EINiDf6iXn7rV4ozj6KcbQCSAzE8Vsnw/viewform?usp=dialog), which will generate a video with camera movements along a new trajectory.** We will send the mp4 file generated by ReCamMaster to your inbox as soon as possible. For camera movement trajectories, we offer 10 basic camera trajectories as follows: | Index | Basic Trajectory | |-------------------|-----------------------------| | 1 | Pan Right | | 2 | Pan Left | | 3 | Tilt Up | | 4 | Tilt Down | | 5 | Zoom In | | 6 | Zoom Out | | 7 | Translate Up (with rotation) | | 8 | Translate Down (with rotation) | | 9 | Arc Left (with rotation) | | 10 | Arc Right (with rotation) | If you would like to use ReCamMaster as a baseline and need qualitative or quantitative comparisons, please feel free to drop an email to [jianhongbai@zju.edu.cn](mailto:jianhongbai@zju.edu.cn). We can assist you with batch inference of our model. -->

⚙️ Run Astra (Inference and Training)

Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .

Step 2: Download the pretrained checkpoints

  1. Download the pre-trained Wan2.1 models
cd script
python download_wan2.1.py
  1. Download the pre-trained Astra checkpoint

Please download from huggingface and place it in models/Astra/checkpoints.

Step 3: Test the example image

python infer_demo.py \
  --dit_path ../models/Astra/checkpoints/diffusion_pytorch_model.ckpt \
  --wan_model_path ../models/Wan-AI/Wan2.1-T2V-1.3B \
  --condition_image ../examples/condition_images/garden_1.png \
  --cam_type 4 \
  --prompt "A sunlit European street lined with historic buildings and vibrant greenery creates a warm, charming, and inviting atmosphere. The scene shows a picturesque open square paved with red bricks, surrounded by classic narrow townhouses featuring tall windows, gabled roofs, and dark-painted facades. On the right side, a lush arrangement of potted plants and blooming flowers adds rich color and texture to the foreground. A vintage-style streetlamp stands prominently near the center-right, contributing to the timeless character of the street. Mature trees frame the background, their leaves glowing in the warm afternoon sunlight. Bicycles are visible along the edges of the buildings, reinforcing the urban yet leisurely feel. The sky is bright blue with scattered clouds, and soft sun flares enter the frame from the left, enhancing the scene’s inviting, peaceful mood."  \
  --output_path ../examples/output_videos/output_moe_framepack_sliding.mp4 \

This inference can be conducted on a single 24GB GPU, such as the NVIDIA 3090.

Step 4: Test your own images

To test with your own custom images, you need to prepare the target images and their corresponding text prompts. We recommend that the size of the input images is close to 832×480 (width × height, 16:9), which is consistent with the resolution of the generated video and can help achieve better video generation effects. For prompts generation, you can refer to the Prompt Extension section in Wan2.1 for guidance on crafting the captions.

python infer_demo.py \
  --dit_path path/to/your/dit_ckpt \

Related Skills

View on GitHub
GitHub Stars232
CategoryContent
Updated1d ago
Forks5

Languages

Python

Security Score

100/100

Audited on Mar 22, 2026

No findings