WISA

World Simulator Assistant for Physics-Aware Text-to-Video Generation

Generate Convert Improve

Install / Use

/learn @360CVGroup/WISA

About this skill

Quality Score

0/100

README

WISA

This is the official reproduction of WISA, designed to enhance Text-to-Video models by improving their ability to simulate the real world.

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation </br> Jing Wang*, Ao Ma*, Ke Cao*, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng‡, Yuhui Yin, Xiaodan Liang‡(*Equal Contribution, ‡Corresponding Authors) </br>

📰 News

[2025.05.15] 🔥 We are excited to announce the official release of WISA's codebase and model weights on GitHub! This implementation is built upon the powerful finetrainers framework.
[2025.03.28] We have uploaded the WISA-80K dataset to Hugging Face, including processed video clips and annotations.
[2025.03.12] We have released our paper WISA and created a dedicated project homepage.

<table align="center" style="width: 100%;"> <tr> <th align="center" style="width: 33%;">Wan2.1-14B</th> <th align="center" style="width: 33%;">WISA</th> <th align="center" style="width: 34%;">Prompt</th> </tr> <tr> <td align="center" style="width: 33%;"> <video width="100%" controls src="https://github.com/user-attachments/assets/fa343d48-2eea-45dc-aec2-ee6897b5995f">Your browser does not support the video tag.</video> </td> <td align="center" style="width: 33%;"> <video width="100%" controls src="https://github.com/user-attachments/assets/45622daf-89ae-41d7-884c-5e443526c5f4">Your browser does not support the video tag.</video> </td> <td align="center" style="width: 34%;"> A dry clump of soil rests on a flat surface, with fine details of its texture and cracks visible. ... </td> </tr> <tr> <td align="center" style="width: 33%;"> <video width="100%" controls src="https://github.com/user-attachments/assets/be4ed08d-4d73-4eea-846f-74f65abcd875">Your browser does not support the video tag.</video> </td> <td align="center" style="width: 33%;"> <video width="100%" controls src="https://github.com/user-attachments/assets/307081d7-c9f5-4f5a-b484-3120cc5131e1">Your browser does not support the video tag.</video> </td> <td align="center" style="width: 34%;"> The camera focuses on a toothpaste tube on the bathroom countertop. As a finger gently applies... </td> </tr> <tr> <td align="center" style="width: 33%;"> <video width="100%" controls src="https://github.com/user-attachments/assets/db5111c5-6d39-4bb3-ad05-475d38227017">Your browser does not support the video tag.</video> </td> <td align="center" style="width: 33%;"> <video width="100%" controls src="https://github.com/user-attachments/assets/0187608a-0d5c-4daa-a948-b436b80d1425">Your browser does not support the video tag.</video> </td> <td align="center" style="width: 34%;"> A bowl of clear water sits in the center of a freezer. As the temperature gradually drops... </td> </tr> </table>

🚀 Quick Started

1. Environment Set Up

Clone this repository and install packages.

git clone https://github.com/360CVGroup/WISA.git
cd WISA
conda create -n wisa python=3.10
conda activate wisa
pip install -r requirements.txt

2. Download Pretrained Weights

1. Download Text-to-Video Pretrained Models

Please download CogvideoX and Wan2.1 checkpoints from ModelScope and put it in ./pretrain_models/.

mkdir ./pretrain_models
cd ./pretrain_models
pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B-Diffusers --local_dir ./Wan2.1-T2V-14B-Diffusers
modelscope download ZhipuAI/CogVideoX-5b --local_dir ./CogVideoX-5b-Diffusers

2. Download WISA Pretrained Lora and Physical-block Weight

Please download weight from Huggingface and put it in ./pretrain_models/WISA/.

git lfs install
git clone https://huggingface.co/qihoo360/WISA
cd ..

3. Generate Video

You can revise the MODEL_TYPE, GEN_TYPE, PROMPT_PATH, OUTPUT_FILE and LORA_PATH in inference.sh for different inference settings. Then run

sh inference.sh

✨ Training

1. Download WISA-80K

Download the WISA-80K dataset from huggingface.

2. Precomputing Latents and Text Embeddings (Optional)

This project supports precomputing and saving the latent codes of videos and text embeddings to avoid loading the VAE and Text Encoder onto the GPU during training, thereby reducing GPU memory usage. This operation is essential when training Wan2.1-14B; otherwise, it will result in an out-of-memory (OOM) error.

Step 1: you need to add the following parameters to the dataset_cmd in your training script (like examples/training/sft/wan/crush_smol_lora/train_wisa.sh), and ensure you have sufficient storage space available.

dataset_cmd=(
  --dataset_config $TRAINING_DATASET_CONFIG
  --dataset_shuffle_buffer_size 10
  --precomputation_items 2000        # Number of samples to precompute
  --enable_precomputation            # Flag to activate precomputation
  --precomputation_once
  --precomputation_dir ./cache/path  # Directory for cached outputs
  --hash_save                        # Enable hash-based filename storage
  --first_samples
)

Step 2: Configure dataset paths in file examples/training/sft/wan/crush_smol_lora/training_wisa.json and execute

sh examples/training/sft/wan/crush_smol_lora/train_wisa.sh

"Note: Process data in batches to prevent CPU cache overload (recommended maximum: 12,000 samples per batch)."

Step 3: Disable --enable_precomputation flag

dataset_cmd=(
  --dataset_config $TRAINING_DATASET_CONFIG
  --dataset_shuffle_buffer_size 10
  --precomputation_items 2000        # Number of samples to precompute
  # --enable_precomputation            # Flag to activate precomputation
  --precomputation_once
  --precomputation_dir ./cache/path  # Directory for cached outputs
  --hash_save                        # Enable hash-based filename storage
  --first_samples
)

3. Start Training

sh examples/training/sft/wan/crush_smol_lora/train_wisa.sh

Due to quality issues in the validation phase (bug-induced video generation artifacts causing significant deviation from test-phase results), we have disabled validation.

👍 Acknowledgement

This work stands on the shoulders of groundbreaking research and open-source contributions. We extend our deepest gratitude to the authors and contributors of the following projects:

CogVideoX - For their pioneering work in video generation
Wan2.1 - For their foundational contributions to large-scale video models

Special thanks to the finetrainers framework for enabling efficient model training - your excellent work has been invaluable to this project.

BibTeX

@misc{wang2025wisa,
                title={WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation}, 
                author={Jing Wang and Ao Ma and Ke Cao and Jun Zheng and Zhanjie Zhang and Jiasong Feng and Shanyuan Liu and Yuhang Ma and Bo Cheng and Dawei Leng and Yuhui Yin and Xiaodan Liang},
                year={2025},
                eprint={2502.08153},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2502.08153}, 
}

Related Skills

docs-writer

98.6k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

329.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

2.8k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

Assume the personality of the Persona described in any of the document available in the @~/.ai/personas directory.

360CVGroup

View profile

View on GitHub

GitHub Stars269

CategoryContent

Updated9d ago

Forks42

360CVGroup/WISA

Languages

Python

Security Score

95/100

Audited on Mar 13, 2026

No findings