SkillAgentSearch skills...

FactualSceneGraph

[ACL 2023 Findings] FACTUAL dataset, the textual scene graph parser trained on FACTUAL.

Install / Use

/learn @zhuang-li/FactualSceneGraph
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

[ACL 2023 Findings] FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

<p align="center"> <strong>Faithful and Consistent Textual Scene Graph Parsing</strong><br/> Official repository for the ACL 2023 Findings paper, with code, datasets, pretrained models, and evaluation tools. </p> <p align="center"> <a href="https://aclanthology.org/2023.findings-acl.398"> <img src="https://img.shields.io/badge/Paper-FACTUAL-blue.svg" alt="FACTUAL"> </a> <a href="https://arxiv.org/abs/2305.17497"> <img src="https://img.shields.io/badge/arXiv-2305.17497-b31b1b.svg" alt="arXiv 2305.17497"> </a> <a href="https://arxiv.org/abs/2506.15583"> <img src="https://img.shields.io/badge/Paper-DiscoSG--Refiner-purple.svg" alt="DiscoSG-Refiner"> </a> <a href="https://pypi.org/project/FactualSceneGraph/"> <img src="https://img.shields.io/pypi/v/FactualSceneGraph?color=green" alt="PyPI version"> </a> <a href="https://pepy.tech/projects/FactualSceneGraph"> <img src="https://static.pepy.tech/badge/FactualSceneGraph" alt="Downloads"> </a> <a href="https://opensource.org/licenses/MIT"> <img src="https://img.shields.io/badge/License-MIT-green.svg" alt="MIT License"> </a> </p> <p align="center"> <img src="https://github.com/zhuang-li/FACTUAL/blob/main/logo/monash_logo.png" alt="Monash University Logo" height="72" /> <img src="https://github.com/zhuang-li/FACTUAL/blob/main/logo/adobe_logo.png" alt="Adobe Logo" height="72" /> <img src="https://github.com/zhuang-li/FACTUAL/blob/main/logo/wuhan_logo.png" alt="Wuhan University Logo" height="72" /> </p>

Overview

FACTUAL is a benchmark and toolkit for faithful, consistent, and practical textual scene graph parsing. It provides:

  • pretrained parsers for converting text into scene graphs,
  • benchmark datasets for training and evaluation,
  • evaluation tools including SPICE, Soft-SPICE, and Set Match,
  • support for both single-sentence and multi-sentence scene graph parsing.

This repository now also includes discourse-level multi-sentence parsing with two options:

| Mode | Best for | Description | |---|---|---| | default | Single sentences | Standard one-pass parsing | | sentence_merge | Multi-sentence descriptions | Split → parse → merge → deduplicate | | DiscoSG-Refiner | Highest-quality multi-sentence parsing | Sentence merging followed by iterative refinement |


Highlights

  • 40,369 FACTUAL scene graph instances with lemmatized predicates
  • 2.9M cleaned Visual Genome scene graph instances for pretraining
  • Pretrained Flan-T5 models in multiple sizes
  • Advanced discourse-level parsing for long, multi-sentence descriptions
  • Unified evaluation toolkit for scene graph parsing and caption evaluation
  • Supports CPU, CUDA, and Apple Silicon (MPS)

Installation

pip install FactualSceneGraph

For DiscoSG-Refiner

The discosg dependency is installed automatically. If installation fails or you want the latest version:

pip install --upgrade FactualSceneGraph

For development or the newest GitHub version:

pip install git+https://github.com/zhuang-li/FACTUAL.git

Notes

  • Python: 3.8+
  • Apple Silicon: use device='mps'
  • CUDA: install a CUDA-enabled PyTorch build before installing this package

Main Dependencies

  • Core: torch, transformers, nltk, spacy
  • Refinement: discosg, peft, huggingface-hub
  • Evaluation: sentence-transformers, pandas, numpy

Quick Start

1) Minimal example

from factual_scene_graph.parser.scene_graph_parser import SceneGraphParser

parser = SceneGraphParser('lizhuang144/flan-t5-base-VG-factual-sg', device='cpu')
result = parser.parse(
    ["2 beautiful pigs are flying on the sky with 2 bags on their backs"],
    beam_size=1,
    return_text=True
)

print(result[0])
# ( pigs , is , 2 ) , ( pigs , is , beautiful ) , ( bags , on back of , pigs ) ,
# ( pigs , fly on , sky ) , ( bags , is , 2 )

2) Raw Transformers usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("lizhuang144/flan-t5-base-VG-factual-sg")
model = AutoModelForSeq2SeqLM.from_pretrained("lizhuang144/flan-t5-base-VG-factual-sg")

text = tokenizer(
    "Generate Scene Graph: 2 pigs are flying on the sky with 2 bags on their backs",
    max_length=200,
    return_tensors="pt",
    truncation=True
)

generated_ids = model.generate(
    text["input_ids"],
    attention_mask=text["attention_mask"],
    use_cache=True,
    decoder_start_token_id=tokenizer.pad_token_id,
    num_beams=1,
    max_length=200,
    early_stopping=True
)

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
# Output: `( pigs , is , 2 ) , ( bags , on back of , pigs ), ( bags , is , 2 ) , ( pigs , fly on , sky )`

In this output format, the predicate is corresponds to an attribute-style relation.


Multi-Sentence Parsing

Modern VLMs often generate long, rich descriptions instead of single-sentence captions. FACTUAL now supports multi-sentence scene graph parsing in two ways.

Option A — sentence_merge

Use this for efficient parsing of long descriptions, detailed captions, or short stories.

from factual_scene_graph.parser.scene_graph_parser import SceneGraphParser

parser = SceneGraphParser(
    'lizhuang144/flan-t5-base-VG-factual-sg',
    parser_type='sentence_merge',
    device='cpu'
)

descriptions = [
    """The image captures a serene scene in a park. A gravel path, dappled with sunlight
    filtering through the tall trees on either side, winds its way towards a white bridge.""",

    """A bustling urban scene unfolds in the city center. People walk along the sidewalks
    carrying shopping bags. Cars and buses navigate through the busy streets.""",
]

results = parser.parse(
    descriptions,
    beam_size=5,
    batch_size=8,
    return_text=True
)

How it works

  1. Split text into sentences with NLTK
  2. Parse each sentence efficiently in batches
  3. Merge graphs automatically
  4. Deduplicate repeated entities and relations

Option B — DiscoSG-Refiner

Use this when you want the highest-quality multi-sentence scene graphs.

from factual_scene_graph.parser.scene_graph_parser import SceneGraphParser

parser = SceneGraphParser(
    'lizhuang144/flan-t5-base-VG-factual-sg',
    parser_type='DiscoSG-Refiner',
    refiner_checkpoint_path='sqlinn/DiscoSG-Refiner-Large-t5-only',
    device='cuda'
)

result = parser.parse(
    ["""The image captures a bustling urban scene, likely in a European city.
    The setting appears to be a pedestrian-friendly square or plaza.
    There are numerous people of various ages and attire walking around."""],
    task='delete_before_insert',
    refinement_rounds=2,
    max_input_len=1024,
    max_output_len=512,
    batch_size=2,
    beam_size=5,
    return_text=True
)

Why use it

  • iterative multi-round refinement,
  • smart single-sentence vs multi-sentence handling,
  • graceful fallback when the input is too long,
  • strong performance for research and high-precision applications.

Which one should I choose?

| Parser type | Recommended use | Strength | |---|---|---| | default | Simple captions or short text | Fastest single-pass parser | | sentence_merge | Long descriptions and multi-sentence text | Fast, stable, efficient | | DiscoSG-Refiner | Research and highest-quality parsing | Best quality through iterative refinement |

Example: mixed batch with automatic refinement

from factual_scene_graph.parser.scene_graph_parser import SceneGraphParser

parser = SceneGraphParser(
    'lizhuang144/flan-t5-base-VG-factual-sg',
    parser_type='DiscoSG-Refiner',
    refiner_checkpoint_path='sqlinn/DiscoSG-Refiner-Large-t5-only',
    device='mps'
)

mixed_descriptions = [
    "A red car is parked.",
    """The dog runs in the park. The park has green grass and tall trees.
    Children are playing on the swings.""",
    "The cat sleeps on the sofa.",
    """The restaurant is busy tonight. Waiters serve food to customers at tables.
    The kitchen staff prepares fresh meals. Soft music plays in the background."""
]

results = parser.parse(
    mixed_descriptions,
    max_input_len=1024,
    max_output_len=512,
    beam_size=5,
    batch_size=4,
    refinement_rounds=2,
    task='delete_before_insert',
    return_text=True
)

Recommended settings

  • max_input_len=512~1024 for long descriptions
  • refinement_rounds=2~3 for a good quality/speed balance
  • device='mps' for Apple Silicon, device='cuda' for NVIDIA GPUs
  • tune batch_size based on memory budget

Safety behavior

The parser checks whether max_input_len is sufficient before running refinement. If the input is too long, refinement is skipped automatically and the parser falls back to sentence merging.

parser.parse(["Very long description..."], beam_size=5, max_input_len=64)
# Warning: Skipping DiscoSG refinement because max_input_len is too small.

Datasets

FACTUAL Scene Graph Dataset

The FACTUAL Scene Graph dataset contains 40,369 instances with lemmatized predicates/relations.

Storage

  • Local: data/factual_sg/factual_sg.csv
  • Hugging Face: load_dataset('lizhuang144/FACTUAL_Scene_Graph')

Splits

| Split type | Train | Dev | Test | |---|---|---|---| | Random | data/factual_sg/random/train.csv | data/factual_sg/random/dev.csv | data/factual_sg/random/test.csv | | Length | data/factual_sg/length/train.csv | data/factual_sg/length/dev.csv | data/factual_sg/length/test.csv |

Fields

  • image_id: Visual Genome image ID
  • region_id: Visual Genome region ID
  • caption: region caption
  • scene_graph: scene graph for the region caption

Related resource

  • Visual Genome images/regions: `load_dataset(
View on GitHub
GitHub Stars127
CategoryDevelopment
Updated2d ago
Forks12

Languages

Python

Security Score

85/100

Audited on Mar 23, 2026

No findings