SkillAgentSearch skills...

DiscoSG

[EMNLP 2025 Outstanding Paper Award] Official repo for DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

Install / Use

/learn @ShaoqLin/DiscoSG
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

EMNLP 2025 Outstanding Paper Award Paper arXiv

Official repository for "🪩DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement"

🏆 EMNLP 2025 Outstanding Paper Award (7 of 3,200 accepted papers)

Paper | Code

</div>

📰 News

  • [2025-11] 🎉 Our paper has been selected as an EMNLP 2025 Outstanding Paper Award (7 of 3,200 accepted papers)!
  • [2025-08] Paper accepted to EMNLP 2025 Main Conference
  • [2025-06] Initial release of DiscoSG-DS dataset and code

🌟 Highlights

DiscoSG addresses the critical gap in discourse-level text scene graph parsing for Vision-Language Models (VLMs):

  • 🎯 Novel Task: First benchmark for discourse-level (multi-sentence) text scene graph parsing
  • 📊 Rich Dataset: DiscoSG-DS with 400 expert-annotated + 8,430 synthesized instances
  • 🚀 Efficient Method: DiscoSG-Refiner achieves 86× faster inference than GPT-4o with comparable performance
  • 🔧 Practical Impact: Significant improvements on downstream VLM tasks including caption evaluation and hallucination detection

Why Discourse-Level Parsing?

Traditional scene graph parsers are designed for single-sentence captions and fail to capture:

  • ✅ Cross-sentence coreference (e.g., "woman" → "she")
  • ✅ Long-range dependencies between sentences
  • ✅ Implicit relationships across discourse
  • ✅ Global graph coherence

📊 Dataset: DiscoSG-DS

Dataset Composition

The DiscoSG-DS dataset is located in the DiscoSG_dataset folder:

| Split | Human-Annotated | Synthesized | Total | |-------|----------------|-------------|-------| | Train | 300 | 8,430 | 8,730 | | Test (Random) | 100 | - | 100 | | Test (Length) | 100 | - | 100 |

Comparison with Existing Benchmarks

| Dataset | # Inst. | Avg Len | Avg Trp | Avg Obj | Avg Rel | Total Trp | |---------|--------:|--------:|--------:|--------:|--------:|----------:| | VG | 2,966,195 | 5.34 | 1.53 | 1.69 | 1.22 | 4,533,271 | | FACTUAL | 40,369 | 6.08 | 1.76 | 2.12 | 1.57 | 71,124 | | TSGBench | 2,034 | 12.23 | 5.81 | 5.63 | 3.65 | 11,820 | | DiscoSG-Human | 400 | 181.15 | 20.49 | 10.11 | 6.54 | 8,195 | | DiscoSG-Synthetic | 8,430 | 163.07 | 19.41 | 10.06 | 6.39 | 163,640 |

Legend:

  • Avg Len: Average tokens per instance
  • Avg Trp/Obj/Rel: Average triples/objects/relations per graph
  • Total Trp: Total triples across dataset

Key Insight: DiscoSG instances contain 3× more triples and 30× longer text than existing datasets, capturing complex discourse-level relationships across an average of 9.3 sentences per caption.

Dataset Creation Pipeline

Our dataset creation follows a two-stage process combining human expertise with active learning:

Stage 1: Initial Set Creation

[Initial Set Creation Pipeline]

<!-- Expected image content: - Fixed Validation Set - Init. Set Creation (Two-stage annotation) - Init Finetuned Teacher Model (M₀) -->
  • Two-stage annotation process for quality control
  • Creates seed training set for bootstrapping
  • Establishes baseline teacher model (M₀)

Stage 2: Active Learning

[Active Learning Loop]

<!-- Expected image content: - Seed Training Set - Batch Selection → Draft Annotation → Two-stage Review → Model Update cycle - Finetuned Teacher Model (Mᵢ) → Updated Teacher Model (Mᵢ₊₁) -->

Four-step iterative process:

  1. Batch Selection: Random sampling from unlabeled data
  2. Draft Annotation: Use current model (Mᵢ) to generate draft annotations
  3. Two-Stage Review: Human correction and validation
  4. Model Update: Retrain model (Mᵢ → Mᵢ₊₁) with new annotations

🔧 Method: DiscoSG-Refiner

DiscoSG-Refiner is a lightweight iterative framework that refines draft scene graphs through a novel 4-step refinement process.

<img src="figs/fig0_discosg_main.png" alt="DiscoSG main fig" style="max-width:35%;height:auto;display:block;margin:0 auto;" />

Architecture Overview

Caption 
→ [Step 1] Initial Graph
→ [Step 2] Deletion
→ [Step 3] Insertion
→ [Step 4] Refined Graph

Four-Step Refinement Process

Step 1: Initial Graph Generation from Sentence-Level Parsing

| Component | Content | |-----------|---------| | Caption | A group of people are seen walking on a concrete pier towards a ferry terminal . . . In the distance, tall buildings loom, indicating that the location is near a city . . . (details omitted for brevity) | | Caption (split) | S1: A group of people are seen...<br>S2: In the distance, tall buildings loom... | | Sentence-level Parsing | G1: (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete)<br>G2: (buildings, is, tall) | | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) |

Process:

  • Split multi-sentence caption into individual sentences
  • Parse each sentence independently using sentence-level parser
  • Merge sentence-level graphs into initial draft

Step 2: Encoder-Based Deletion Prediction

| Component | Content | |-----------|---------| | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) | | Deletion Prediction | ✅ (people, walk on, pier)<br>✅ (people, walk towards, ferry terminal)<br>❌ (people, move towards, destination)<br>✅ (pier, is, concrete)<br>✅ (buildings, is, tall) | | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) |

Process:

  • Encode caption and each graph triple
  • Binary classifier predicts KEEP/DELETE for each triple
  • Removes redundant or incorrect triples

Step 3: Decoder-Based Insertion Generation

| Component | Content | |-----------|---------| | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) | | Insertion Input | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall) | | Insertion Output | (people, is, group of) | | 3. Insertion. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) |

Process:

  • Encode caption and current graph state after deletion
  • Decoder generates missing triples
  • Adds complementary information to graph

Step 4: Refinement

| Component | Content | |-----------|---------| | Caption | A group of people are seen walking on a concrete pier towards a ferry terminal . . . In the distance, tall buildings loom, indicating that the location is near a city . . . (details omitted for brevity) | | 1. Init. | (people, walk on, pier), (people, walk towards, ferry terminal), (people, move towards, destination), (pier, is, concrete), (buildings, is, tall) | | 2. Deletion. | (people, walk on, pier), (people, walk towards, ferry terminal), ~~(people, move towards, destination)~~, (pier, is, concrete), (buildings, is, tall) | | 3. Insertion. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) | | 4. Refined. | (people, walk on, pier), (people, walk towards, ferry terminal), (pier, is, concrete), (buildings, is, tall), (people, is, group of) |

Process:

  • Execute deletion (Step 2)
  • Execute insertion (Step 3)
  • (Optional) Repeat refinement cycle for further improvement

Key Advantages

| Feature | DiscoSG-Refiner | GPT-4o | Traditional Parsers | |---------|----------------|---------|---------------------| | Speed | ⚡ 86× faster | Baseline | Fast but inaccurate | | Accuracy | 🎯 Comparable | Highest | Poor on discourse | | Cost | 💰 Low | Very High | Low | | Open Source | ✅ Yes | ❌ No | ✅ Yes |


🚀 Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/ShaoqLin/DiscoSG.git
cd DiscoSG

# Install dependencies
pip install -r requirements.txt

1. Dataset Configuration

Configure dataset paths in the following files:

detailcap_discosg_mr.py (line 64):

# Replace with your path to DiscoSG_datasets directory
dataset_path = "path/to/DiscoSG_datasets"

dataset_utils.py (lines 136, 167):

# Replace with your path to DiscoSG_datasets directory
dataset_path = "path/to/DiscoSG_datasets"

💡 Note: Use the same dataset settings as DetailCaps and CapArena

2. Fast Inference with Reusable Graphs

For quick inference, use pre-computed graphs from the reusable_graph directory:

python detailcap_discosg_mr.py \
  --original_parse_dict reusable_graph/original_parse.json \
  --sub_sentence_parse_dict reusable_graph/sub_sentence_parse.json \
  --combined_parse_dict reusable_graph/combined_parse.json

Available parameters:

parser.add_argument("--original_parse_dict", type=str, default=None, 
             
View on GitHub
GitHub Stars22
CategoryDevelopment
Updated17d ago
Forks2

Languages

Python

Security Score

75/100

Audited on Mar 23, 2026

No findings