SkillAgentSearch skills...

EditWorld

[ACM Multimedia 2025 Datasets Track] EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

Install / Use

/learn @YangLing0818/EditWorld
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

[ACM Multimedia 2025] EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

News

August 1, 2025

  • Our EditWorld is accepted by ACM Multimedia 2025 Datasets Track.

June 23, 2024

  • After consulting with the sponsors, we have released a training dataset that has not been manually rechecked. The dataset link is EditWorld_data. Best of luck with your research!

Overview

This repository contains the official implementation of our EditWorld. In this work, we introduce a new task namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). We also propose a new post-edit method for world-instructed image editing.

World Instruction vs. Traditional Instruction

first_img

Generated Results of Our EditWorld:

sample1

Planning

  • [√] Providing full pipeline of text-to-image generation for EditWorld dataset.
  • [√] Releasing evaluation dataset.
  • [√] Releasing basic training dataset.
  • [ ] Releasing Checkpoints.
  • [ ] Releasing training and post-edit code.

Codebase

Text-to-image generation branch

Firstly, we employ GPT-3.5 to provide textual quadruples:

python gpt_script/text_img_gen_aigcbest_full.py --define_json gpt_script/define_sample_history/define_sample.json --output_path gpt_script/gen_sample_history/ --output_json text_gen.json

Then, we transform the text prompt provided by GPT into dict:

python tools/deal_text2json.py --input_json gpt_script/gen_sample_history/text_gen.json --output_json text_gen_full.json

Finally, we obtain the input-instruct-output triples based on the generated textual quadruples:

python t2i_branch_base.py --text_json text_gen_full.json --save_path datasets/editworld/generated_img/

It is worth noting that t2i_branch_base.py is the fast and basic version for text-to-image generation branch, we will improve this part in the future.

Video branch

Path video_script contains the code for downloading videos from the InternVid.

Dataset

Dataset structure

To obtain the training dataset file train.json, utilize the script located at tools/obtain_datasetjson.py. The dataset is organized in the following structure:

datasets/
├── editworld/
│   ├── generated_img/
│   │   ├── group_0/
│   │   │   ├── sample0_ori.png
│   │   │   ├── sample0_tar.png
│   │   │   ...
│   │   │   └── img_txt.json
│   │   └── group_1/
│   │   ...
│   ├── video_img/
│   │   ├── group_0/
│   │   │   ├── sample0_ori.png
│   │   │   ├── sample0_tar.png
│   │   │   ...
│   │   │   └── img_txt.json
│   │   └── group_1/
│   │   ...
│   └── human_select_img/
│       ├── group_0/
│       │   ├── sample0_ori.png
│       │   ├── sample0_tar.png
│       │   ...
│       │   └── img_txt.json
│       └── group_1/
│       ...
└── train.json

Evaluation dataset link

Our evaluation dataset is available at editworld_test.

Quantitative Comparison of CLIP Score and MLLM Score

IP2P: InstructPix2Pix; MB: MagicBrush. Bold results are the best.

CLIP Score of Text-to-image Branch

| Category | IP2P | MB | Editworld | w/o post-edit | |--------------------|----------|----------|-----------|-----------------| | Long-Term | 0.2140 | 0.1870 | 0.2244 | 0.2294 | | Physical-Trans | 0.2186 | 0.2101 | 0.2385 | 0.2467 | | Implicit-Logic | 0.2390 | 0.2432 | 0.2542| 0.2440 | | Story-Type | 0.2063 | 0.2070 | 0.2534| 0.2354 | | Real-to-Virtual | 0.2285 | 0.2344 | 0.2524| 0.2435 |

CLIP Score of Video Branch

| Category | IP2P | MB | Editworld | w/o post-edit | |--------------------|----------|----------|-----------|-----------------| | Spatial-Trans | 0.2175 | 0.1997 | 0.2420| 0.2286 | | Physical-Trans | 0.2315 | 0.2278 | 0.2467 | 0.2483 | | Story-Type | 0.2318 | 0.2262 | 0.2365 | 0.2399 | | Exaggeration | 0.2416 | 0.2328 | 0.2443| 0.2433 |

MLLM Score of Both Branches

| Category | IP2P | MB | Editworld | w/o post-edit | |--------------------|----------|----------|-----------|-----------------| | Text-to-image | 0.8763 | 0.8455 | 0.8958 | 0.9060 | | Video | 0.9493 | 0.9715 | 0.9920| 0.9891 |

Citation

@article{yang2024editworld,
  title={EditWorld: Simulating World Dynamics for Instruction-Following Image Editing},
  author={Yang, Ling and Zeng, Bohan and Liu, Jiaming and Li, Hong and Xu, Minghao and Zhang, Wentao and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2405.14785},
  year={2024}
}

Related Skills

View on GitHub
GitHub Stars140
CategoryContent
Updated25d ago
Forks6

Languages

Python

Security Score

85/100

Audited on Mar 8, 2026

No findings