X2Edit
AAAI2026 X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning
Install / Use
/learn @OPPO-Mente-Lab/X2EditREADME
X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning <br> Jian Ma<sup>1</sup>, Xujie Zhu<sup>2</sup>, Zihao Pan<sup>2</sup>, Qirong Peng<sup>1</sup>, Xu Guo<sup>3</sup>, Chen Chen<sup>1</sup>, Haonan Lu<sup>1</sup>
<br>
<sup>1</sup>OPPO AI Center, <sup>2</sup>Sun Yat-sen University, <sup>3</sup>Tsinghua University <br>
X2Edit image generation results
<div align="center"> <img src="assets/X2Edit images.jpg"> </div>News
- 2025/09/16: We release a dataset built using Qwen-Image and Qwen-Image-Edit. This sub-dataset specifically focuses on subject-driven generation with facial consistency—a key requirement for tasks requiring stable subject identity across generated content. Asian-portrait and NonAsian-portrait
- 2025/08/25: Support Qwen-Image for training and inference. Checkpoint
X2Edit image generation results with Qwen-Image
<div align="center"> <img src="assets/qwen-image1.png"> </div> <div align="center"> <img src="assets/qwen-image0.png"> </div>Environment
Prepare the environment, install the required libraries:
$ cd X2Edit
$ conda create --name X2Edit python==3.11
$ conda activate X2Edit
$ pip install -r requirements.txt
Clone LaMa to data_pipeline and rename it to lama. Clone SAM and GroundingDINO to SAM, and then rename them to segment_anything and GroundingDINO
Data Construction
(./assets/dataset_detail.jpg)
X2Edit provides executable scripts for each data construction workflow shown in the figure. We organize the dataset using the WebDataset format. Please replace the dataset in the scripts. The following Qwen model can be selected from Qwen2.5-VL-72B, Qwen3-8B, and Qwen2.5-VL-7B. In addition, we also use aesthetic scoring models for screening, please donwload SigLIP and aesthetic-predictor-v2-5, and then change the path in siglip_v2_5.py.
- Subject Addition & Deletion → use
expert_subject_deletion.pyandexpert_subject_deletion_filter.py: The former script is used to construct deletion-type data, while the latter uses the fine-tuned Qwen2.5-VL-7B to further screen the constructed deletion-type data. Before executing, download RAM, GroundingDINO, SAM, Randeng-Deltalm, InfoXLM, RMBG and LaMa. - Normal Editing Tasks → use
step1x_data.py: Please download the checkpoint Step1X-Edit. The language model we use is Qwen2.5-VL-72B. - Subject-Driven Generation → use
kontext_subject_data.py: Please download the checkpoints FLUX.1-Kontext, DINOv2, CLIP, OPUS-MT-zh-en, shuttle-3-diffusion. The language model we use is Qwen3-8B. - Style Transfer → use
kontext_style_transfer.py: Please download the checkpoints FLUX.1-Kontext, DINOv2, CLIP, OPUS-MT-zh-en, shuttle-3-diffusion. The language model we use is Qwen3-8B. - Style Change → use
expert_style_change.py: Please download the checkpoints FLUX.1-dev, OmniConsistency. We use Qwen2.5-VL-7B to score. - Text Change → use
expert_text_change_ch.pyfor Chinese and useexpert_text_change_en.pyfor English: Please download the checkpoint textflux. We use Qwen2.5-VL-7B to score. - Complex Editing Tasks → use
bagel_data.py: Please download the checkpoint Bagel. We use Qwen2.5-VL-7B to score. - High Fidelity Editing Tasks → use
gpt4o_data.py: Please download the checkpoint OPUS-MT-zh-en and use your own GPT-4o API. We use Qwen2.5-VL-7B to score. - High Resoluton Data Construction → use
kontext_data.py: Please download the checkpoint FLUX.1-dev and OPUS-MT-zh-en. We use Qwen2.5-VL-7B to score.
Inference
We provides inference scripts for editing images with resolutions of 1024 and 512. In addition, we can choose the base model of X2Edit, including FLUX.1-Krea, FLUX.1-dev, FLUX.1-schnell, PixelWave, shuttle-3-diffusion, and choose the LoRA for integration with MoE-LoRA including Turbo-Alpha, AntiBlur, Midjourney-Mix2, Super-Realism, Chatgpt-Ghibli. Choose the model you like and download it. For the MoE-LoRA, we will open source a unified checkpoint that can be used for both 512 and 1024 resolutions.
Before executing the script, download Qwen3-8B to select the task type for the input instruction, base model(FLUX.1-Krea, FLUX.1-dev, FLUX.1-schnell, shuttle-3-diffusion), MLLM and Alignet. All scripts follow analogous command patterns. Simply replace the script filename while maintaining consistent parameter configurations.
$ python infer.py --device cuda --pixel 1024 --num_experts 12 --base_path BASE_PATH --qwen_path QWEN_PATH --lora_path LORA_PATH --extra_lora_path EXTRA_LORA_PATH
$ python infer_qwen.py --device cuda --pixel 1024 --num_experts 12 --base_path BASE_PATH --qwen_path QWEN_PATH --lora_path LORA_PATH --extra_lora_path EXTRA_LORA_PATH ## for Qwen-Image backbone
device: The device used for inference. default: cuda<br>
pixel: The resol
