EdiVal
[ICLR 2026] Official code for [EdiVal-Agent Automated, object-centric evaluation for multi-turn instruction-based image editing]
Install / Use
/learn @TianyuCodings/EdiValREADME
<!--- BADGES: START ---> <!--- BADGES: END --->Want to run EdiVal-Agent on your own images? Jump to the Bring Your Own Images section for a step-by-step walkthrough.
Project Website • Hugging Face Repository
Welcome to the official repository for EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing (arXiv:2509.13399). With the toolkit in this repo you can generate fresh multi-turn instructions, run your own editing models (or ours) against the benchmark, evaluate instruction-following, consistency, and quality across turns, and reproduce every experiment from the paper with the provided scripts and notebooks.
<details> <summary><strong>Table of Contents</strong></summary>- Overview
- Overall Multi-Turn Editing Leaderboard
- Repository Structure
- Getting Started
- Data Hub
- Instruction Generation Pipeline
- Image Generation
- Evaluation
- Analysis & Reproduction
- Bring Your Own Images
- Citation
- Support & Updates
Overview
- Goal: benchmark instruction-following, consistency and perceptual quality in sequential (multi-turn) image editing.
- Inputs: 512×512 images and curated 3-turn editing instructions.
- Outputs: multipass & singlepass generations, instruction-following scores, consistency metrics, and quality scores (including optional HPSv3).
Overall Multi-Turn Editing Leaderboard
| Rank | Creator | Model | Technique | Score | Release | | --- | --- | --- | --- | --- | --- | | 1🥇 | <img src="assets/bytedance.png" alt="ByteDance logo" height="20"/> ByteDance | Seedream 4.0 | Unknown | 59.76 | Sep 2025 | | 2🥈 | <img src="assets/google.svg" alt="Google logo" height="20"/> Google | Nano Banana | Unknown | 56.24 | Aug 2025 | | 3🥉 | <img src="assets/openai.svg" alt="OpenAI logo" height="20"/> OpenAI | GPT-Image-1 | Unknown | 53.81 | Jul 2025 | | 4 | <img src="assets/flux.svg" alt="Black Forest Labs logo" height="20"/> Black Forest Labs | FLUX.1-Kontext-max | Flow Matching | 53.04 | Jun 2025 | | 5 | <img src="assets/google.svg" alt="Google logo" height="20"/> Google | Gemini 2.0 Flash | Unknown | 47.94 | Feb 2025 | | 6 | <img src="assets/ablibaba.svg" alt="Alibaba logo" height="20"/> Alibaba | Qwen-Image-Edit | Flow Matching | 41.93 | Aug 2025 | | 7 | <img src="assets/stepfun.svg" alt="StepFun logo" height="20"/> StepFun | Step1X-Edit | Flow Matching | 38.98 | Apr 2025 | | 8 | <img src="assets/flux.svg" alt="Black Forest Labs logo" height="20"/> Black Forest Labs | FLUX.1-Kontext-dev | Flow Matching | 38.71 | Jun 2025 | | 9 | <img src="assets/ominigen.png" alt="VectorSpaceLab logo" height="20"/> VectorSpaceLab | OmniGen | Flow Matching | 29.91 | Sep 2024 | | 10 | - | UltraEdit | Diffusion | 22.89 | Jul 2024 | | 11 | - | AnyEdit | Diffusion | 22.50 | Nov 2024 | | 12 | - | MagicBrush | Diffusion | 19.41 | Jun 2023 | | 13 | - | InstructPix2Pix | Diffusion | 12.99 | Dec 2023 |
Repository Structure
env_setup/– Conda environment specification (env.yaml) and bootstrap script (setup_edival.sh).generate_instructions/– Full instruction pipeline: object parsing, grounding filter, CSV export, and candidate pools.generate.py– Use your editor to generated editted images, with Qwen-Image-Edit model as an example.baseline_generate/– Historical baseline scripts retained for comparison, including GPT-Image-1, Flux models.detector/– Instruction-following, consistency, and quality evaluation modules.eval.py/eval_bash.sh– Core evaluator and batch helper.example_evaluate_results/– Reference outputs for sanity checking. Your output should have the similar structure.analysis.ipynb– Notebook used to analysis your final results inexample_evaluate_results/.oai_instruction_generation_output.csv– Sample 3-turn instruction CSV. The instructions we used in our paper.update_hps_scores.py– Utility to backfill HPSv3 scores into evaluation JSONs. If you need hpsv3 quality score, use this scripts, since hpsv3 score env conflicts with other metrics.
Getting Started
# 1. Create the environment (once)
bash env_setup/setup_edival.sh
# 2. Activate for every new shell
conda activate edival
The bootstrap script installs PyTorch (CUDA 12.1), GroundingDINO (editable mode), diffusers, vLLM dependencies, and all evaluation packages. Modify env_setup/env.yaml if you need different versions.
Data Hub
All assets live in the Hugging Face repository C-Tianyu/EdiVal:
input_images_resize_512.zip– canonical 512×512 image set.baseline_generations/*– pre-generated outputs: GPT-Image-1, Nano Banana, SeedDream v4, etc.
Download the resources you need and place them at the repository root (paths can be overridden via CLI flags).
Instruction Generation Pipeline
All scripts reside in generate_instructions/. Candidate vocabularies are stored in generate_instructions/candidate_pools/.
Before you start: set
OPENAI_API_KEY(and optionallyOPENAI_API_BASEif you use a custom endpoint).
-
Object Extraction
export OPENAI_API_KEY=sk-... python generate_instructions/oai_all_objects.py \ --input-dir input_images_resize_512 \ --output-dir generate_instructions/oai_all_objectsProduces rich JSON metadata (
<index>_input_raw.json) for every image. -
Grounding Filter
python generate_instructions/grounding_filter.py \ --input-dir generate_instructions/oai_all_objects \ --output-dir generate_instructions/grounding_all_objects \ --image-dir input_images_resize_512 \ --num-gpus 2 \ --box-threshold 0.35 \ --text-threshold 0.35Uses GroundingDINO to keep only visually grounded objects, adding bounding boxes and counts.
-
CSV Export
export OPENAI_API_KEY=sk-... python generate_instructions/oai_instruction_generator.py \ --grounding-dir generate_instructions/grounding_all_objects \ --input-images input_images_resize_512 \ --output oai_instruction_generation_output.csv \ --seed 42Generates the multi-turn instruction CSV used downstream. Candidate pools come from
generate_instructions/candidate_pools/*.txt; regenerate them withgenerate_instructions/candidate_pools/generate_objects_txt.pyif you need to refresh vocabularies.
Image Generation
Run Qwen/Qwen-Image-Edit (or your own editor) over the instructions:
python generate.py \
--csv oai_instruction_generation_output.csv \
--zip input_images_resize_512.zip \
--output-dir your_generations \
--num-gpus 2
Outputs land in your_generations/multipass and your_generations/singlepass. To plug in a custom model, implement a class that exposes generate_single_edit and point to it via:
python generate.py --editor-class my_module:MyCustomGenerator ...
Evaluation
-
Single Folder
python eval.py \ --generation_folder your_generations \ --modes multipass singlepassComputes instruction-following, consistency, and quality metrics (HPSv3 if available) and writes JSON summaries to
evaluate_results/your_generations/. -
Batch Mode
bash eval_bash.shAdjust
BASE_DIR,JOBS, andGPU_GROUPSat the top of the script to suit your hardware; the helper loops over all subfolders inBASE_DIR. -
Fill in HPSv3 (Optional)
bash env_setup/setup_hpsv3.sh && conda activate hps && python update_hps_scores.py \ --results_root evaluate_results/your_generations \ --num_gpus 2 \ --batch_size 4Backfills missing HPSv3 scores into the evaluation JSONs.
Analysis & Reproduction
analysis.ipynb– Aggregates evaluation outputs and generates the plots/tables from the paper.example_evaluate_results/– Reference outputs to verify your setup.
Launch Jupyter or VS Code within the edival environment to explore the results interactively.
Bring Your Own Images
Want to evaluate your own dataset? Follow the same three-stage pipeline used for EdiVal's release.
- Prepare Inputs
- Collect/source your raw images (ideally resized to 512×512 for parity with the benchmark).
- Place the ZIP (or directory) alongs
