SpatialEdit
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
Install / Use
/learn @EasonXiao-888/SpatialEditREADME
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
<p align="center"> Yicheng Xiao*, Wenhu Zhang*, Lin Song ✉️, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi ✉️ </p> <p align="center"> 🧭 Fine-grained spatial editing | 🧪 Benchmarking | 🎥 Camera and object manipulation </p> <p align="center"> <a href="https://github.com/EasonXiao-888/SpatialEdit/blob/main/pdf/SpatialEditing.pdf"><img src='https://img.shields.io/badge/SpatialEdit-Paper-red?logo=bookstack&logoColor=red'></a> <a href="https://huggingface.co/datasets/EasonXiao-888/SpatialEdit-500K"><img src="https://img.shields.io/badge/SpatialEdit500K-Data-yellow?logo=huggingface&logoColor=yellow"></a> <a href="https://huggingface.co/EasonXiao-888/SpatialEdit-16B"><img src="https://img.shields.io/badge/SpatialEdit16B-Model-blue?logo=huggingface&logoColor=yellow"></a> <a href="https://huggingface.co/datasets/EasonXiao-888/SpatialEdit-Bench"><img src="https://img.shields.io/badge/SpatialEditBench-Data-green?logo=huggingface&logoColor=yellow"></a> </p>🎬 Demo
The following demo showcases our method on fine-grained spatial editing from spatially controlled endpoints.
https://github.com/user-attachments/assets/b42c7a51-1220-4690-9fcb-5892672cb87d
🚀 Application Gallery
🧊 3D Point Control
<p align="center"> <img src="assets/application/3dpoint/01.gif" width="23%" /> <img src="assets/application/3dpoint/02.gif" width="23%" /> <img src="assets/application/3dpoint/11.gif" width="23%" /> <img src="assets/application/3dpoint/12.gif" width="23%" /> </p>✨ The first and third examples show point clouds with only a single given viewpoint. The second and fourth examples are augmented by our model, which synthesizes richer spatial observations from the sparse input view.
🎥 Conditional-frames Based Video Generation:
✨ Given the first frame, our editing model first performs spatial editing to produce the final frame of the video. Subsequently, the video generation model synthesizes a coherent transition sequence, while preserving scene realism and thematic consistency.
📷 Camera Trajectory Transformation
<p align="center"> <img src="assets/application/camera/input.png" width="31%" /> <img src="assets/application/camera/output.png" width="31%" /> <img src="assets/application/camera/video.gif" width="31%" /> </p>🚶 Object Moving
<p align="center"> <img src="assets/application/moving/input.png" width="31%" /> <img src="assets/application/moving/output.png" width="31%" /> <img src="assets/application/moving/video.gif" width="31%" /> </p> <p align="center"> <img src="assets/application/moving/input2.png" width="31%" /> <img src="assets/application/moving/output2.png" width="31%" /> <img src="assets/application/moving/video2.gif" width="31%" /> </p>🔄 Object Rotation
<p align="center"> <img src="assets/application/rotation/input.png" width="31%" /> <img src="assets/application/rotation/output.png" width="31%" /> <img src="assets/application/rotation/video.gif" width="31%" /> </p>📝 Abstract
Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite.
Our contributions are three-fold:
- We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis.
- To address the data bottleneck for scalable training, we construct SpatialEdit-500K, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations.
- Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks.
🔗 Resources
| Resource | Description | Link | | --- | --- | --- | | 🧪 Training Data | SpatialEdit-500K synthetic training set for scalable fine-grained spatial editing | 🤗Hugging Face | | 🧠 Model Weights | SpatialEdit-16B checkpoints for image spatial editing | 🤗Hugging Face | | 🖼️ Benchmark Images | SpatialEdit-Bench benchmark images and evaluation assets | 🤗Hugging Face |
🌍 Overview
SpatialEdit focuses on spatially grounded image editing, where the goal is not just to change appearance, but to control object motion, rotation, 3D viewpoint, framing, and camera movement with precision.

📏 SpatialEdit-Bench
SpatialEdit-Bench evaluates both object-centric and camera-centric edits. The benchmark is designed to score whether an edited image is visually plausible while also satisfying the requested spatial transformation.

🏗️ SpatialEdit-500K Data Engine
To support scalable training and controlled evaluation, SpatialEdit-500K is built with a synthetic rendering pipeline that systematically varies object pose, placement, and camera trajectories over diverse scenes.

🎨 Visual Comparisons
Qualitative comparisons highlight the advantage of SpatialEdit on fine-grained spatial manipulation tasks.


⚙️ Installation
Create a Python environment and install the dependencies:
pip install -r requirements.txt
pip install accelerate peft gradio pillow
Notes:
flash_attninrequirements.txtrequires a compatible CUDA and PyTorch environment.- Some config files still contain placeholder or internal paths and should be updated before running inference.
📦 Prerequisites
Before running the code, please download the required external checkpoints first:
- VGGT: required for camera-level benchmark evaluation.
- YOLO26x: required for framing evaluation. The current benchmark script expects
yolo26x.pt. - Qwen3-VL-8B-Instruct: used as the vision-language backbone in the current config.
- Wan2.1-T2V-1.3B: download the
Wan2.1_VAE.pthweights used by the VAE configuration.
🧪 Quick Demo
The repo currently provides a simple local inference entry point:
python spatialedit_demo.py
Before running, update the checkpoint paths in spatialedit_demo.py:
ckpt_path_PTckpt_path_CTdevice
The example input image is located at validation/JD_Dog.jpeg.
🏃 Benchmark Inference
To generate edited outputs for SpatialEdit-Bench, use:
torchrun --nnodes 1 --nproc_per_node 8 SpatialEdit-Bench/eval_inference.py \
--config configs/spatialedit_base_config.py \
--ckpt-path /path/to/checkpoint_or_lora \
--save-path /path/to/save_dir \
--meta-file /path/to/SpatialEdit_Bench_Meta_File.json \
--bench-data-dir /path/to/SpatialEdit_Bench_Data \
--basesize 1024 \
--num-inference-steps 50 \
--guidance-scale 5.0 \
--seed 42
You can also adapt the provided launcher script:
SpatialEdit-Bench/scripts/dist_inference.sh
📊 Benchmark Evaluation
📷 Camera-Level Evaluation
Camera-level evaluation measures viewpoint reconstruction and framing fidelity:
bash SpatialEdit-Bench/scripts/dist_camera_eval.sh
Update the placeholder paths in the script before running:
VGGTYOLOEVAL_DATAMETA_DATA_FILE
🧩 Object-Level Evaluation
Object-level evaluation scores edit faithfulness and benchmark statistics:
bash SpatialEdit-Bench/scripts/dist_object_eval.sh
Update the script paths and evaluation backend first:
META_FILESAVEBENCH_DATA_DIRBACKBONE
💡 Notes
configs/spatialedit_base_config.pycurrently contains internal absolute paths and should be replaced with your local model paths.- The benchmark scripts assume access to external benchmark metadata, source images, and model checkpoints.
- The repo already includes example evaluation utilities under
SpatialEdit-Bench/camera_level_evalandSpatialEdit-Bench/object_level_eval.
❤️ Acknowledgement
Code in this repository builds upon several excellent open-source projects. We sincerely thank ReCamMaster and TexVerse for their outstanding contributions.
We also extend our gratitude to Yanbing Zhang for his valuable support throughout this project.
Additionally, our resource construction pipeline and experiments have contributed to the development of the image editing model in JoyAI-Image.
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
