Official repository for "🪩DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement"

🏆 EMNLP 2025 Outstanding Paper Award (7 of 3,200 accepted papers)

Paper | Code

</div>

📰 News

[2025-11] 🎉 Our paper has been selected as an EMNLP 2025 Outstanding Paper Award (7 of 3,200 accepted papers)!
[2025-08] Paper accepted to EMNLP 2025 Main Conference
[2025-06] Initial release of DiscoSG-DS dataset and code

🌟 Highlights

DiscoSG addresses the critical gap in discourse-level text scene graph parsing for Vision-Language Models (VLMs):

🎯 Novel Task: First benchmark for discourse-level (multi-sentence) text scene graph parsing
📊 Rich Dataset: DiscoSG-DS with 400 expert-annotated + 8,430 synthesized instances
🚀 Efficient Method: DiscoSG-Refiner achieves 86× faster inference than GPT-4o with comparable performance
🔧 Practical Impact: Significant improvements on downstream VLM tasks including caption evaluation and hallucination detection

Why Discourse-Level Parsing?

Traditional scene graph parsers are designed for single-sentence captions and fail to capture:

✅ Cross-sentence coreference (e.g., "woman" → "she")
✅ Long-range dependencies between sentences
✅ Implicit relationships across discourse
✅ Global graph coherence

📊 Dataset: DiscoSG-DS

Dataset Composition

The DiscoSG-DS dataset is located in the DiscoSG_dataset folder:

| Split | Human-Annotated | Synthesized | Total | |-------|----------------|-------------|-------| | Train | 300 | 8,430 | 8,730 | | Test (Random) | 100 | - | 100 | | Test (Length) | 100 | - | 100 |

Comparison with Existing Benchmarks

| Dataset | # Inst. | Avg Len | Avg Trp | Avg Obj | Avg Rel | Total Trp | |---------|--------:|--------:|--------:|--------:|--------:|----------:| | VG | 2,966,195 | 5.34 | 1.53 | 1.69 | 1.22 | 4,533,271 | | FACTUAL | 40,369 | 6.08 | 1.76 | 2.12 | 1.57 | 71,124 | | TSGBench | 2,034 | 12.23 | 5.81 | 5.63 | 3.65 | 11,820 | | DiscoSG-Human | 400 | 181.15 | 20.49 | 10.11 | 6.54 | 8,195 | | DiscoSG-Synthetic | 8,430 | 163.07 | 19.41 | 10.06 | 6.39 | 163,640 |

Legend:

Avg Len: Average tokens per instance
Avg Trp/Obj/Rel: Average triples/objects/relations per graph
Total Trp: Total triples across dataset

Key Insight: DiscoSG instances contain 3× more triples and 30× longer text than existing datasets, capturing complex discourse-level relationships across an average of 9.3 sentences per caption.

Dataset Creation Pipeline

Our dataset creation follows a two-stage process combining human expertise with active learning:

Stage 1: Initial Set Creation

[Initial Set Creation Pipeline]

Two-stage annotation process for quality control
Creates seed training set for bootstrapping
Establishes baseline teacher model (M₀)

Stage 2: Active Learning

[Active Learning Loop]

Four-step iterative process:

Batch Selection: Random sampling from unlabeled data
Draft Annotation: Use current model (Mᵢ) to generate draft annotations
Two-Stage Review: Human correction and validation
Model Update: Retrain model (Mᵢ → Mᵢ₊₁) with new annotations

🔧 Method: DiscoSG-Refiner

DiscoSG-Refiner is a lightweight iterative framework that refines draft scene graphs through a novel 4-step refinement process.

Architecture Overview

Caption 
→ [Step 1] Initial Graph
→ [Step 2] Deletion
→ [Step 3] Insertion
→ [Step 4] Refined Graph

Four-Step Refinement Process

Step 1: Initial Graph Generation from Sentence-Level Parsing

| Component | Content | |-----------|---------| | Caption | A group of people are seen walking on a concrete pier towards a ferry terminal . . . In the distance, tall buildings loom, indicating that the location is near a city . . . (details omitted for brevity) | | Caption (split) | S1: A group of people are seen...<br>S2: In the distance, tall buildings loom... | | Sentence-level Parsing | G1: (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete)<br>G2: (buildings, is, tall) | | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) |

Process:

Split multi-sentence caption into individual sentences
Parse each sentence independently using sentence-level parser
Merge sentence-level graphs into initial draft

Step 2: Encoder-Based Deletion Prediction

| Component | Content | |-----------|---------| | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) | | Deletion Prediction | ✅ (people, walk on, pier)<br>✅ (people, walk towards, ferry terminal)<br>❌ (people, move towards, destination)<br>✅ (pier, is, concrete)<br>✅ (buildings, is, tall) | | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) |

Process:

Encode caption and each graph triple
Binary classifier predicts KEEP/DELETE for each triple
Removes redundant or incorrect triples

Step 3: Decoder-Based Insertion Generation

| Component | Content | |-----------|---------| | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) | | Insertion Input | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall) | | Insertion Output | (people, is, group of) | | 3. Insertion. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) |

Process:

Encode caption and current graph state after deletion
Decoder generates missing triples
Adds complementary information to graph

Step 4: Refinement

| Component | Content | |-----------|---------| | Caption | A group of people are seen walking on a concrete pier towards a ferry terminal . . . In the distance, tall buildings loom, indicating that the location is near a city . . . (details omitted for brevity) | | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) | | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) | | 3. Insertion. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) | | 4. Refined. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) |

Process:

Execute deletion (Step 2)
Execute insertion (Step 3)
(Optional) Repeat refinement cycle for further improvement

Key Advantages

| Feature | DiscoSG-Refiner | GPT-4o | Traditional Parsers | |---------|----------------|---------|---------------------| | Speed | ⚡ 86× faster | Baseline | Fast but inaccurate | | Accuracy | 🎯 Comparable | Highest | Poor on discourse | | Cost | 💰 Low | Very High | Low | | Open Source | ✅ Yes | ❌ No | ✅ Yes |

🚀 Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/ShaoqLin/DiscoSG.git
cd DiscoSG

# Install dependencies
pip install -r requirements.txt

1. Dataset Configuration

Configure dataset paths in the following files:

detailcap_discosg_mr.py (line 64):

# Replace with your path to DiscoSG_datasets directory
dataset_path = "path/to/DiscoSG_datasets"

dataset_utils.py (lines 136, 167):

# Replace with your path to DiscoSG_datasets directory
dataset_path = "path/to/DiscoSG_datasets"

💡 Note: Use the same dataset settings as DetailCaps and CapArena

2. Fast Inference with Reusable Graphs

For quick inference, use pre-computed graphs from the reusable_graph directory:

python detailcap_discosg_mr.py \
  --original_parse_dict reusable_graph/original_parse.json \
  --sub_sentence_parse_dict reusable_graph/sub_sentence_parse.json \
  --combined_parse_dict reusable_graph/combined_parse.json

Available parameters:

parser.add_argument("--original_parse_dict", type=str, default=None,

DiscoSG

Install / Use

README

📰 News

🌟 Highlights

Why Discourse-Level Parsing?

📊 Dataset: DiscoSG-DS

Dataset Composition

Comparison with Existing Benchmarks

Dataset Creation Pipeline

Stage 1: Initial Set Creation

Stage 2: Active Learning

🔧 Method: DiscoSG-Refiner

Architecture Overview

Four-Step Refinement Process

Step 1: Initial Graph Generation from Sentence-Level Parsing

Step 2: Encoder-Based Deletion Prediction

Step 3: Decoder-Based Insertion Generation

Step 4: Refinement

Key Advantages

🚀 Quick Start

Prerequisites

1. Dataset Configuration

2. Fast Inference with Reusable Graphs