DiscoSG
[EMNLP 2025 Outstanding Paper Award] Official repo for DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Install / Use
/learn @ShaoqLin/DiscoSGREADME
Official repository for "🪩DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement"
🏆 EMNLP 2025 Outstanding Paper Award (7 of 3,200 accepted papers)
</div>📰 News
- [2025-11] 🎉 Our paper has been selected as an EMNLP 2025 Outstanding Paper Award (7 of 3,200 accepted papers)!
- [2025-08] Paper accepted to EMNLP 2025 Main Conference
- [2025-06] Initial release of DiscoSG-DS dataset and code
🌟 Highlights
DiscoSG addresses the critical gap in discourse-level text scene graph parsing for Vision-Language Models (VLMs):
- 🎯 Novel Task: First benchmark for discourse-level (multi-sentence) text scene graph parsing
- 📊 Rich Dataset: DiscoSG-DS with 400 expert-annotated + 8,430 synthesized instances
- 🚀 Efficient Method: DiscoSG-Refiner achieves 86× faster inference than GPT-4o with comparable performance
- 🔧 Practical Impact: Significant improvements on downstream VLM tasks including caption evaluation and hallucination detection
Why Discourse-Level Parsing?
Traditional scene graph parsers are designed for single-sentence captions and fail to capture:
- ✅ Cross-sentence coreference (e.g., "woman" → "she")
- ✅ Long-range dependencies between sentences
- ✅ Implicit relationships across discourse
- ✅ Global graph coherence
📊 Dataset: DiscoSG-DS
Dataset Composition
The DiscoSG-DS dataset is located in the DiscoSG_dataset folder:
| Split | Human-Annotated | Synthesized | Total | |-------|----------------|-------------|-------| | Train | 300 | 8,430 | 8,730 | | Test (Random) | 100 | - | 100 | | Test (Length) | 100 | - | 100 |
Comparison with Existing Benchmarks
| Dataset | # Inst. | Avg Len | Avg Trp | Avg Obj | Avg Rel | Total Trp | |---------|--------:|--------:|--------:|--------:|--------:|----------:| | VG | 2,966,195 | 5.34 | 1.53 | 1.69 | 1.22 | 4,533,271 | | FACTUAL | 40,369 | 6.08 | 1.76 | 2.12 | 1.57 | 71,124 | | TSGBench | 2,034 | 12.23 | 5.81 | 5.63 | 3.65 | 11,820 | | DiscoSG-Human | 400 | 181.15 | 20.49 | 10.11 | 6.54 | 8,195 | | DiscoSG-Synthetic | 8,430 | 163.07 | 19.41 | 10.06 | 6.39 | 163,640 |
Legend:
- Avg Len: Average tokens per instance
- Avg Trp/Obj/Rel: Average triples/objects/relations per graph
- Total Trp: Total triples across dataset
Key Insight: DiscoSG instances contain 3× more triples and 30× longer text than existing datasets, capturing complex discourse-level relationships across an average of 9.3 sentences per caption.
Dataset Creation Pipeline
Our dataset creation follows a two-stage process combining human expertise with active learning:
Stage 1: Initial Set Creation
![[Initial Set Creation Pipeline]](figs/fig1_init_set_creation.png)
- Two-stage annotation process for quality control
- Creates seed training set for bootstrapping
- Establishes baseline teacher model (M₀)
Stage 2: Active Learning
![[Active Learning Loop]](figs/fig2_acti_learning.png)
Four-step iterative process:
- Batch Selection: Random sampling from unlabeled data
- Draft Annotation: Use current model (Mᵢ) to generate draft annotations
- Two-Stage Review: Human correction and validation
- Model Update: Retrain model (Mᵢ → Mᵢ₊₁) with new annotations
🔧 Method: DiscoSG-Refiner
DiscoSG-Refiner is a lightweight iterative framework that refines draft scene graphs through a novel 4-step refinement process.
<img src="figs/fig0_discosg_main.png" alt="DiscoSG main fig" style="max-width:35%;height:auto;display:block;margin:0 auto;" />Architecture Overview
Caption
→ [Step 1] Initial Graph
→ [Step 2] Deletion
→ [Step 3] Insertion
→ [Step 4] Refined Graph
Four-Step Refinement Process
Step 1: Initial Graph Generation from Sentence-Level Parsing
| Component | Content | |-----------|---------| | Caption | A group of people are seen walking on a concrete pier towards a ferry terminal . . . In the distance, tall buildings loom, indicating that the location is near a city . . . (details omitted for brevity) | | Caption (split) | S1: A group of people are seen...<br>S2: In the distance, tall buildings loom... | | Sentence-level Parsing | G1: (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete)<br>G2: (buildings, is, tall) | | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) |
Process:
- Split multi-sentence caption into individual sentences
- Parse each sentence independently using sentence-level parser
- Merge sentence-level graphs into initial draft
Step 2: Encoder-Based Deletion Prediction
| Component | Content | |-----------|---------| | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) | | Deletion Prediction | ✅ (people, walk on, pier)<br>✅ (people, walk towards, ferry terminal)<br>❌ (people, move towards, destination)<br>✅ (pier, is, concrete)<br>✅ (buildings, is, tall) | | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) |
Process:
- Encode caption and each graph triple
- Binary classifier predicts KEEP/DELETE for each triple
- Removes redundant or incorrect triples
Step 3: Decoder-Based Insertion Generation
| Component | Content | |-----------|---------| | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) | | Insertion Input | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall) | | Insertion Output | (people, is, group of) | | 3. Insertion. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) |
Process:
- Encode caption and current graph state after deletion
- Decoder generates missing triples
- Adds complementary information to graph
Step 4: Refinement
| Component | Content | |-----------|---------| | Caption | A group of people are seen walking on a concrete pier towards a ferry terminal . . . In the distance, tall buildings loom, indicating that the location is near a city . . . (details omitted for brevity) | | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) | | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) | | 3. Insertion. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) | | 4. Refined. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) |
Process:
- Execute deletion (Step 2)
- Execute insertion (Step 3)
- (Optional) Repeat refinement cycle for further improvement
Key Advantages
| Feature | DiscoSG-Refiner | GPT-4o | Traditional Parsers | |---------|----------------|---------|---------------------| | Speed | ⚡ 86× faster | Baseline | Fast but inaccurate | | Accuracy | 🎯 Comparable | Highest | Poor on discourse | | Cost | 💰 Low | Very High | Low | | Open Source | ✅ Yes | ❌ No | ✅ Yes |
🚀 Quick Start
Prerequisites
# Clone the repository
git clone https://github.com/ShaoqLin/DiscoSG.git
cd DiscoSG
# Install dependencies
pip install -r requirements.txt
1. Dataset Configuration
Configure dataset paths in the following files:
detailcap_discosg_mr.py (line 64):
# Replace with your path to DiscoSG_datasets directory
dataset_path = "path/to/DiscoSG_datasets"
dataset_utils.py (lines 136, 167):
# Replace with your path to DiscoSG_datasets directory
dataset_path = "path/to/DiscoSG_datasets"
💡 Note: Use the same dataset settings as DetailCaps and CapArena
2. Fast Inference with Reusable Graphs
For quick inference, use pre-computed graphs from the reusable_graph directory:
python detailcap_discosg_mr.py \
--original_parse_dict reusable_graph/original_parse.json \
--sub_sentence_parse_dict reusable_graph/sub_sentence_parse.json \
--combined_parse_dict reusable_graph/combined_parse.json
Available parameters:
parser.add_argument("--original_parse_dict", type=str, default=None,
