<h2 align="center">SurveyX: Academic Survey Automation via Large Language Models</h2> <p align="center"> <i> ✨Welcome to SurveyX! If you want to experience the full features, please log in to our website. This open-source code only provides offline processing capabilities.✨ </i> <br> <a href="https://arxiv.org/abs/2502.14776"> <img src="https://img.shields.io/badge/arXiv-Paper-red.svg?logo=arxiv" alt="arxiv paper"> </a> <a href="http://www.surveyx.cn"> <img src="https://img.shields.io/badge/SurveyX-Web-blue?style=flat" alt="surveyx.cn"> </a> <a href="https://huggingface.co/papers/2502.14776"> <img src="https://img.shields.io/badge/Huggingface-🤗-yellow?style=flat" alt="huggingface paper"> </a> <a href="https://github.com/IAAR-Shanghai/SurveyX"> <img src="https://img.shields.io/github/stars/IAAR-Shanghai/SurveyX?style=flat&logo=github&color=yellow" alt="github stars"> </a> <img src="https://img.shields.io/github/last-commit/IAAR-Shanghai/SurveyX?display_timestamp=author&style=flat&color=green" alt="last commit"> </a> <br> <a href="https://github.com/IAAR-Shanghai/SurveyX/blob/main/assets/user_groups_123.jpg"> <img src="https://img.shields.io/badge/Wechat-Group-07c160?style=flat&logo=wechat" alt="Wechat Group"> </a> </p> <div align="center"> <strong><a>If you find our work helpful, don't forget to give us a star! ⭐️</a></strong> <br> 👉 <strong><a href="https://surveyx.cn/">Visit SurveyX</a></strong> 👈 </div>

[English | 中文]

🤔What is SurveyX?

surveyx_frame

SurveyX is an advanced academic survey automation system that leverages the power of Large Language Models (LLMs) to generate high-quality, domain-specific academic papers and surveys. By simply providing a paper title and keywords for literature retrieval, users can request comprehensive academic papers or surveys tailored to specific topics.

🆚 Full Version vs. Offline Open Source Version

The open-source code in this repository only provides offline processing capabilities. If you want to experience the full features, please log in to our website.

Missing features in the open-source version:

Real-time online search: You can only generate surveys based on your own uploaded .md format references. The open-source version lacks access to our paper database, web crawler system, keyword expansion algorithms, and dual-layer semantic filtering for literature acquisition.
Multimodal document parsing: The generated survey will not include image understanding or illustrations from the references.

🛠️ How to Use the Offline Open Source Version (This repo)

1. Prerequisites

Python 3.10+ (Anaconda recommended)
All Python dependencies in requirements.txt
LaTeX environment (for PDF compilation):
You need to convert all your reference documents to Markdown (.md) format and put them together in a single folder before running the pipeline.

sudo apt update && sudo apt install texlive-full

2. Installation

Clone the repository:

git clone https://github.com/IAAR-Shanghai/SurveyX.git
cd SurveyX

Install Python dependencies:

pip install -r requirements.txt

3. LLM Configuration

Edit src/configs/config.py to provide your LLM API URL, token, and model information before running the pipeline.

Example:

REMOTE_URL = "https://api.openai.com/v1/chat/completions"
TOKEN = "sk-xxxx..."
DEFAULT_EMBED_ONLINE_MODEL = "BAAI/bge-base-en-v1.5"
EMBED_REMOTE_URL = "https://api.siliconflow.cn/v1/embeddings"
EMBED_TOKEN = "your embed token here"

4. Workflow

Each run creates a unique result folder under outputs/, named by the task id outputs/<task_id> (e.g., outputs/2025-06-18-0935_keyword/).

Run the full pipeline:

python tasks/offline_run.py --title "Your Survey Title" --key_words "keyword1, keyword2, ..." --ref_path "path/to/your/reference/dir"

Or run step by step:

export task_id="your_task_id"
python tasks/workflow/03_gen_outlines.py --task_id $task_id
python tasks/workflow/04_gen_content.py --task_id $task_id
python tasks/workflow/05_post_refine.py --task_id $task_id
python tasks/workflow/06_gen_latex.py --task_id $task_id

Note: Your local reference documents must be in Markdown (.md) format and placed in a single directory.

5. Output

All results are saved under outputs/<task_id>/
- survey.pdf: Final compiled survey
- outlines.json: Generated outline
- latex/: LaTeX sources
- tmp/: Intermediate files

Example Papers

| Title | Keywords | | ------------------------------------------------------------ | ------------------------------------------------------------ | |A Survey of NoSQL Database Systems for Flexible and Scalable Data Management | NoSQL, Database Systems, Flexibility, Scalability, Data Management | |Vector Databases and Their Role in Modern Data Management and Retrieval A Survey | Vector Databases, Data Management, Data Retrieval, Modern Applications | |Graph Databases A Survey on Models, Data Modeling, and Applications | Graph Databases, Data Modeling | |A Survey on Large Language Model Integration with Databases for Enhanced Data Management and Survey Analysis | Large Language Models, Database Integration, Data Management, Survey Analysis, Enhanced Processing | |A Survey of Temporal Databases Real-Time Databases and Data Management Systems | Temporal Databases, Real-Time Databases, Data Management | | From BERT to GPT-4: A Survey of Architectural Innovations in Pre-trained Language Models | Transformer, BERT, GPT-3, self-attention, masked language modeling, cross-lingual transfer, model scaling | | Unsupervised Cross-Lingual Word Embedding Alignment: Techniques and Applications | low-resource NLP, few-shot learning, data augmentation, unsupervised alignment, synthetic corpora, NLLB, zero-shot transfer | | Vision-Language Pre-training: Architectures, Benchmarks, and Emerging Trends | multimodal learning, CLIP, Whisper, cross-modal retrieval, modality fusion, video-language models, contrastive learning | | Efficient NLP at Scale: A Review of Model Compression Techniques | model compression, knowledge distillation, pruning, quantization, TinyBERT, edge computing, latency-accuracy tradeoff | | Domain-Specific NLP: Adapting Models for Healthcare, Law, and Finance | domain adaptation, BioBERT, legal NLP, clinical text analysis, privacy-preserving NLP, terminology extraction, few-shot domain transfer | | Attention Heads of Large Language Models: A Survey | attention head, attention mechanism, large language model, LLM,transformer architecture, neural networks, natural language processing | | Controllable Text Generation for Large Language Models: A Survey | controlled text generation, text generation, large language model, LLM,natural language processing | | A survey on evaluation of large language models | evaluation of large language models,large language models assessment, natural language processing, AI model evaluation | | Large language models for generative information extraction: a survey | information extraction, large language models, LLM,natural language processing, generative AI, text mining | | Internal consistency and self feedback of LLM | Internal consistency, self feedback, large language model, LLM,natural language processing, model evaluation, AI reliability | | Review of Multi Agent Offline Reinforcement Learning | multi agent, offline policy, reinforcement learning,decentralized learning, cooperative agents, policy optimization | | Reasoning of large language model: A survey | reasoning of large language models, large language models, LLM,natural language processing, AI reasoning, transformer models | | Hierarchy Theorems in Computational Complexity: From Time-Space Tradeoffs to Oracle Separations | P vs NP, NP-completeness, polynomial hierarchy, space complexity, oracle separation, Cook-Levin theorem | | Classical Simulation of Quantum Circuits: Complexity Barriers and Implications | BQP, quantum supremacy, Shor's algorithm, post-quantum cryptography, QMA, hidden subgroup problem | | Kernelization: Theory, Techniques, and Limits | fixed-parameter tractable (FPT), kernelization, treewidth, W-hierarchy, ETH (Exponential Time Hypothesis), parameterized reduction | | Optimal Inapproximability Thresholds for Combinatorial Optimization Problems | PCP theorem, approximation ratio, Unique Games Conjecture, APX-hardness, gap-preserving reduction, LP relaxation | | [Hardness in P: When Poly

SurveyX

Install / Use

README