<img src="assets/logo.png" alt="EdiVal-Agent Logo" width="600"/> <h2 style="font-weight: bold; margin-top: 11px;"> [ICLR 2026] EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing </h2>

Want to run EdiVal-Agent on your own images? Jump to the Bring Your Own Images section for a step-by-step walkthrough.

</br>

Project Website • Hugging Face Repository

Welcome to the official repository for EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing (arXiv:2509.13399). With the toolkit in this repo you can generate fresh multi-turn instructions, run your own editing models (or ours) against the benchmark, evaluate instruction-following, consistency, and quality across turns, and reproduce every experiment from the paper with the provided scripts and notebooks.

<details> <summary><strong>Table of Contents</strong></summary>

Overview
Overall Multi-Turn Editing Leaderboard
Repository Structure
Getting Started
Data Hub
Instruction Generation Pipeline
Image Generation
Evaluation
Analysis & Reproduction
Bring Your Own Images
Citation
Support & Updates

</details>

Overview

Goal: benchmark instruction-following, consistency and perceptual quality in sequential (multi-turn) image editing.
Inputs: 512×512 images and curated 3-turn editing instructions.
Outputs: multipass & singlepass generations, instruction-following scores, consistency metrics, and quality scores (including optional HPSv3).

Overall Multi-Turn Editing Leaderboard

| Rank | Creator | Model | Technique | Score | Release | | --- | --- | --- | --- | --- | --- | | 1🥇 | <img src="assets/bytedance.png" alt="ByteDance logo" height="20"/> ByteDance | Seedream 4.0 | Unknown | 59.76 | Sep 2025 | | 2🥈 | <img src="assets/google.svg" alt="Google logo" height="20"/> Google | Nano Banana | Unknown | 56.24 | Aug 2025 | | 3🥉 | <img src="assets/openai.svg" alt="OpenAI logo" height="20"/> OpenAI | GPT-Image-1 | Unknown | 53.81 | Jul 2025 | | 4 | <img src="assets/flux.svg" alt="Black Forest Labs logo" height="20"/> Black Forest Labs | FLUX.1-Kontext-max | Flow Matching | 53.04 | Jun 2025 | | 5 | <img src="assets/google.svg" alt="Google logo" height="20"/> Google | Gemini 2.0 Flash | Unknown | 47.94 | Feb 2025 | | 6 | <img src="assets/ablibaba.svg" alt="Alibaba logo" height="20"/> Alibaba | Qwen-Image-Edit | Flow Matching | 41.93 | Aug 2025 | | 7 | <img src="assets/stepfun.svg" alt="StepFun logo" height="20"/> StepFun | Step1X-Edit | Flow Matching | 38.98 | Apr 2025 | | 8 | <img src="assets/flux.svg" alt="Black Forest Labs logo" height="20"/> Black Forest Labs | FLUX.1-Kontext-dev | Flow Matching | 38.71 | Jun 2025 | | 9 | <img src="assets/ominigen.png" alt="VectorSpaceLab logo" height="20"/> VectorSpaceLab | OmniGen | Flow Matching | 29.91 | Sep 2024 | | 10 | - | UltraEdit | Diffusion | 22.89 | Jul 2024 | | 11 | - | AnyEdit | Diffusion | 22.50 | Nov 2024 | | 12 | - | MagicBrush | Diffusion | 19.41 | Jun 2023 | | 13 | - | InstructPix2Pix | Diffusion | 12.99 | Dec 2023 |

Repository Structure

env_setup/ – Conda environment specification (env.yaml) and bootstrap script (setup_edival.sh).
generate_instructions/ – Full instruction pipeline: object parsing, grounding filter, CSV export, and candidate pools.
generate.py – Use your editor to generated editted images, with Qwen-Image-Edit model as an example.
baseline_generate/ – Historical baseline scripts retained for comparison, including GPT-Image-1, Flux models.
detector/ – Instruction-following, consistency, and quality evaluation modules.
eval.py / eval_bash.sh – Core evaluator and batch helper.
example_evaluate_results/ – Reference outputs for sanity checking. Your output should have the similar structure.
analysis.ipynb – Notebook used to analysis your final results in example_evaluate_results/.
oai_instruction_generation_output.csv – Sample 3-turn instruction CSV. The instructions we used in our paper.
update_hps_scores.py – Utility to backfill HPSv3 scores into evaluation JSONs. If you need hpsv3 quality score, use this scripts, since hpsv3 score env conflicts with other metrics.

Getting Started

# 1. Create the environment (once)
bash env_setup/setup_edival.sh

# 2. Activate for every new shell
conda activate edival

The bootstrap script installs PyTorch (CUDA 12.1), GroundingDINO (editable mode), diffusers, vLLM dependencies, and all evaluation packages. Modify env_setup/env.yaml if you need different versions.

Data Hub

All assets live in the Hugging Face repository C-Tianyu/EdiVal:

input_images_resize_512.zip – canonical 512×512 image set.
baseline_generations/* – pre-generated outputs: GPT-Image-1, Nano Banana, SeedDream v4, etc.

Download the resources you need and place them at the repository root (paths can be overridden via CLI flags).

Instruction Generation Pipeline

All scripts reside in generate_instructions/. Candidate vocabularies are stored in generate_instructions/candidate_pools/.

Before you start: set OPENAI_API_KEY (and optionally OPENAI_API_BASE if you use a custom endpoint).

Object Extraction

export OPENAI_API_KEY=sk-...
python generate_instructions/oai_all_objects.py \
  --input-dir input_images_resize_512 \
  --output-dir generate_instructions/oai_all_objects

Produces rich JSON metadata (<index>_input_raw.json) for every image.

Grounding Filter

python generate_instructions/grounding_filter.py \
  --input-dir generate_instructions/oai_all_objects \
  --output-dir generate_instructions/grounding_all_objects \
  --image-dir input_images_resize_512 \
  --num-gpus 2 \
  --box-threshold 0.35 \
  --text-threshold 0.35

Uses GroundingDINO to keep only visually grounded objects, adding bounding boxes and counts.

CSV Export

export OPENAI_API_KEY=sk-...
python generate_instructions/oai_instruction_generator.py \
  --grounding-dir generate_instructions/grounding_all_objects \
  --input-images input_images_resize_512 \
  --output oai_instruction_generation_output.csv \
  --seed 42

Generates the multi-turn instruction CSV used downstream. Candidate pools come from generate_instructions/candidate_pools/*.txt; regenerate them with generate_instructions/candidate_pools/generate_objects_txt.py if you need to refresh vocabularies.

Image Generation

Run Qwen/Qwen-Image-Edit (or your own editor) over the instructions:

python generate.py \
  --csv oai_instruction_generation_output.csv \
  --zip input_images_resize_512.zip \
  --output-dir your_generations \
  --num-gpus 2

Outputs land in your_generations/multipass and your_generations/singlepass. To plug in a custom model, implement a class that exposes generate_single_edit and point to it via:

python generate.py --editor-class my_module:MyCustomGenerator ...

Evaluation

Single Folder
```
python eval.py \
  --generation_folder your_generations \
  --modes multipass singlepass
```
Computes instruction-following, consistency, and quality metrics (HPSv3 if available) and writes JSON summaries to evaluate_results/your_generations/.
Batch Mode
```
bash eval_bash.sh
```
Adjust BASE_DIR, JOBS, and GPU_GROUPS at the top of the script to suit your hardware; the helper loops over all subfolders in BASE_DIR.

Fill in HPSv3 (Optional)

bash env_setup/setup_hpsv3.sh &&
conda activate hps &&
python update_hps_scores.py \
  --results_root evaluate_results/your_generations \
  --num_gpus 2 \
  --batch_size 4

Backfills missing HPSv3 scores into the evaluation JSONs.

Analysis & Reproduction

analysis.ipynb – Aggregates evaluation outputs and generates the plots/tables from the paper.
example_evaluate_results/ – Reference outputs to verify your setup.

Launch Jupyter or VS Code within the edival environment to explore the results interactively.

Bring Your Own Images

Want to evaluate your own dataset? Follow the same three-stage pipeline used for EdiVal's release.

Prepare Inputs
- Collect/source your raw images (ideally resized to 512×512 for parity with the benchmark).
- Place the ZIP (or directory) alongs

EdiVal

Install / Use

README