SkillAgentSearch skills...

GraphGen

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Install / Use

/learn @InternScience/GraphGen

README

<p align="center"> <img src="assets/logo.png"/> </p> <!-- icon -->

stars forks open issues issue resolution documentation pypi wechat arXiv Hugging Face

Hugging Face Model Scope

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English | 中文

<details close> <summary><b>📚 Table of Contents</b></summary> </details>

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.

📌 Latest Updates

  • 2026.02.04: We support HuggingFace Datasets as input data source for data generation now.
  • 2026.01.15: LLM benchmark synthesis now supports single/multiple-choice & fill-in-the-blank & true-or-false—ideal for education 🌟🌟
  • 2025.12.26: Knowledge graph evaluation metrics about accuracy (entity/relation), consistency (conflict detection), structural robustness (noise, connectivity, degree distribution)
<details> <summary>History</summary>
  • 2025.12.16: Added rocksdb for key-value storage backend and kuzudb for graph database backend support.
  • 2025.12.16: Added vllm for local inference backend support.
  • 2025.12.16: Refactored the data generation pipeline using ray to improve the efficiency of distributed execution and resource management.
  • 2025.12.1: Added search support for NCBI and RNAcentral databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
  • 2025.10.30: We support several new LLM clients and inference backends including Ollama_client, http_client, HuggingFace Transformers and SGLang.
  • 2025.10.23: We support VQA(Visual Question Answering) data generation now. Run script: bash scripts/generate/generate_vqa.sh.
  • 2025.10.21: We support PDF as input format for data generation now via MinerU.
  • 2025.09.29: We auto-update gradio demo on Hugging Face and ModelScope.
  • 2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
  • 2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
  • 2025.04.21: We have released the initial version of GraphGen.
</details>

Effectiveness of GraphGen

Pretrain

Inspired by Kimi-K2's technical report (Improving Token Utility with Rephrasing) and ByteDance Seed's Reformulation for Pretraining Data Augmentation (MGA framework), GraphGen added a rephrase pipeline — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.

Setup: Qwen3-0.6B trained from scratch on SlimPajama-6B.

| Method | ARC-E | ARC-C | HellaSwag | GSM8K | TruthfulQA-MC1 | TruthfulQA-MC2 | Average | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | SlimPajama-6B trained for 2 epochs | 25.55 | 21.08 | 24.48 | 0.08 | 24.36 | 49.90 | 24.24 | | SlimPajama-6B + Executive-Summary Rephrase trained for 1 epoch | 26.43 | 22.70 | 24.75 | 1.36 | 26.19 | 51.90 | 25.56(↑1.32) | | SlimPajama-6B + Cross-Domain Rephrase trained for 1 epoch | 28.79 | 20.22 | 24.46 | 0.00 | 24.97 | 52.41 | 25.14(↑0.9) |

Both rephrase methods lift the average by ~1 point over the baseline with zero additional data — all gains come from how the same knowledge is expressed.

SFT

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) | |:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:| | Plant | SeedBench | 65.9 | 51.5 | | Common | CMMLU | 73.6 | 75.8 | | Knowledge | GPQA-Diamond | 40.0 | 33.3 | | Math | AIME24 | 20.6 | 16.7 | | | AIME25 | 22.7 | 7.2 |

⚙️ Support List

We support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types. Users can flexibly configure according to the needs of synthetic data.

| Inference Server | Api Server | Inference Client | Data Source | Data Modal | Data Type | |--------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------| | ![hf-icon]HF<br>![sg-icon]SGLang<br>![vllm-icon]vllm | ![sif-icon]Silicon<br>![oai-icon]OpenAI<br>![az-icon]Azure | HTTP<br>![ol-icon]Ollama<br>![oai-icon]OpenAI | Files(CSV, JSON, PDF, TXT, etc.)<br>Databases(![uniprot-icon]UniProt, [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>Search Engines([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>Knowledge Graphs([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |

<!-- links -->

Related Skills

View on GitHub
GitHub Stars1.0k
CategoryDevelopment
Updated1d ago
Forks79

Languages

Python

Security Score

100/100

Audited on Mar 30, 2026

No findings