GraphGen

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Generate Convert Improve

Install / Use

/learn @InternScience/GraphGen

About this skill

Quality Score

0/100

README

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English | 中文

<details close> <summary>📚 Table of Contents</summary>

📝 What is GraphGen?
📌 Latest Updates
⚙️ Support List
🚀 Quick Start
🏗️ System Architecture
🍀 Acknowledgements
📚 Citation
📜 License
📅 Star History

</details>

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.

📌 Latest Updates

2026.02.04: We support HuggingFace Datasets as input data source for data generation now.
2026.01.15: LLM benchmark synthesis now supports single/multiple-choice & fill-in-the-blank & true-or-false—ideal for education 🌟🌟
2025.12.26: Knowledge graph evaluation metrics about accuracy (entity/relation), consistency (conflict detection), structural robustness (noise, connectivity, degree distribution)

<details> <summary>History</summary>

2025.12.16: Added rocksdb for key-value storage backend and kuzudb for graph database backend support.
2025.12.16: Added vllm for local inference backend support.
2025.12.16: Refactored the data generation pipeline using ray to improve the efficiency of distributed execution and resource management.
2025.12.1: Added search support for NCBI and RNAcentral databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
2025.10.30: We support several new LLM clients and inference backends including Ollama_client, http_client, HuggingFace Transformers and SGLang.
2025.10.23: We support VQA(Visual Question Answering) data generation now. Run script: bash scripts/generate/generate_vqa.sh.
2025.10.21: We support PDF as input format for data generation now via MinerU.
2025.09.29: We auto-update gradio demo on Hugging Face and ModelScope.
2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
2025.04.21: We have released the initial version of GraphGen.

</details>

Effectiveness of GraphGen

Pretrain

Inspired by Kimi-K2's technical report (Improving Token Utility with Rephrasing) and ByteDance Seed's Reformulation for Pretraining Data Augmentation (MGA framework), GraphGen added a rephrase pipeline — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.

Setup: Qwen3-0.6B trained from scratch on SlimPajama-6B.

| Method | ARC-E | ARC-C | HellaSwag | GSM8K | TruthfulQA-MC1 | TruthfulQA-MC2 | Average | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | SlimPajama-6B trained for 2 epochs | 25.55 | 21.08 | 24.48 | 0.08 | 24.36 | 49.90 | 24.24 | | SlimPajama-6B + Executive-Summary Rephrase trained for 1 epoch | 26.43 | 22.70 | 24.75 | 1.36 | 26.19 | 51.90 | 25.56(↑1.32) | | SlimPajama-6B + Cross-Domain Rephrase trained for 1 epoch | 28.79 | 20.22 | 24.46 | 0.00 | 24.97 | 52.41 | 25.14(↑0.9) |

Both rephrase methods lift the average by ~1 point over the baseline with zero additional data — all gains come from how the same knowledge is expressed.

SFT

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) | |:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:| | Plant | SeedBench | 65.9 | 51.5 | | Common | CMMLU | 73.6 | 75.8 | | Knowledge | GPQA-Diamond | 40.0 | 33.3 | | Math | AIME24 | 20.6 | 16.7 | | | AIME25 | 22.7 | 7.2 |

⚙️ Support List

We support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types. Users can flexibly configure according to the needs of synthetic data.

| Inference Server | Api Server | Inference Client | Data Source | Data Modal | Data Type | |--------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------| | ![hf-icon]HF ![sg-icon]SGLang ![vllm-icon]vllm | ![sif-icon]Silicon ![oai-icon]OpenAI ![az-icon]Azure | HTTP ![ol-icon]Ollama ![oai-icon]OpenAI | Files(CSV, JSON, PDF, TXT, etc.) Databases(![uniprot-icon]UniProt, [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral]) Search Engines([![bing-icon]Bing][bing], [![google-icon]Google][google]) Knowledge Graphs([![wiki-icon]Wikipedia][wiki]) | TEXT IMAGE | Aggregated Atomic CoT Multi-hop VQA |

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。