SurveyX
Academic Survey Paper Generation.
Install / Use
/learn @IAAR-Shanghai/SurveyXREADME
[English | 中文]
🤔What is SurveyX?

SurveyX is an advanced academic survey automation system that leverages the power of Large Language Models (LLMs) to generate high-quality, domain-specific academic papers and surveys. By simply providing a paper title and keywords for literature retrieval, users can request comprehensive academic papers or surveys tailored to specific topics.
🆚 Full Version vs. Offline Open Source Version
The open-source code in this repository only provides offline processing capabilities. If you want to experience the full features, please log in to our website.
Missing features in the open-source version:
- Real-time online search: You can only generate surveys based on your own uploaded
.mdformat references. The open-source version lacks access to our paper database, web crawler system, keyword expansion algorithms, and dual-layer semantic filtering for literature acquisition. - Multimodal document parsing: The generated survey will not include image understanding or illustrations from the references.
🛠️ How to Use the Offline Open Source Version (This repo)
1. Prerequisites
- Python 3.10+ (Anaconda recommended)
- All Python dependencies in
requirements.txt - LaTeX environment (for PDF compilation):
- You need to convert all your reference documents to Markdown (
.md) format and put them together in a single folder before running the pipeline.
sudo apt update && sudo apt install texlive-full
2. Installation
- Clone the repository:
git clone https://github.com/IAAR-Shanghai/SurveyX.git
cd SurveyX
- Install Python dependencies:
pip install -r requirements.txt
3. LLM Configuration
Edit src/configs/config.py to provide your LLM API URL, token, and model information before running the pipeline.
Example:
REMOTE_URL = "https://api.openai.com/v1/chat/completions"
TOKEN = "sk-xxxx..."
DEFAULT_EMBED_ONLINE_MODEL = "BAAI/bge-base-en-v1.5"
EMBED_REMOTE_URL = "https://api.siliconflow.cn/v1/embeddings"
EMBED_TOKEN = "your embed token here"
4. Workflow
Each run creates a unique result folder under outputs/, named by the task id outputs/<task_id> (e.g., outputs/2025-06-18-0935_keyword/).
Run the full pipeline:
python tasks/offline_run.py --title "Your Survey Title" --key_words "keyword1, keyword2, ..." --ref_path "path/to/your/reference/dir"
Or run step by step:
export task_id="your_task_id"
python tasks/workflow/03_gen_outlines.py --task_id $task_id
python tasks/workflow/04_gen_content.py --task_id $task_id
python tasks/workflow/05_post_refine.py --task_id $task_id
python tasks/workflow/06_gen_latex.py --task_id $task_id
Note: Your local reference documents must be in Markdown (.md) format and placed in a single directory.
5. Output
- All results are saved under
outputs/<task_id>/survey.pdf: Final compiled surveyoutlines.json: Generated outlinelatex/: LaTeX sourcestmp/: Intermediate files
Example Papers
| Title | Keywords | | ------------------------------------------------------------ | ------------------------------------------------------------ | |A Survey of NoSQL Database Systems for Flexible and Scalable Data Management | NoSQL, Database Systems, Flexibility, Scalability, Data Management | |Vector Databases and Their Role in Modern Data Management and Retrieval A Survey | Vector Databases, Data Management, Data Retrieval, Modern Applications | |Graph Databases A Survey on Models, Data Modeling, and Applications | Graph Databases, Data Modeling | |A Survey on Large Language Model Integration with Databases for Enhanced Data Management and Survey Analysis | Large Language Models, Database Integration, Data Management, Survey Analysis, Enhanced Processing | |A Survey of Temporal Databases Real-Time Databases and Data Management Systems | Temporal Databases, Real-Time Databases, Data Management | | From BERT to GPT-4: A Survey of Architectural Innovations in Pre-trained Language Models | Transformer, BERT, GPT-3, self-attention, masked language modeling, cross-lingual transfer, model scaling | | Unsupervised Cross-Lingual Word Embedding Alignment: Techniques and Applications | low-resource NLP, few-shot learning, data augmentation, unsupervised alignment, synthetic corpora, NLLB, zero-shot transfer | | Vision-Language Pre-training: Architectures, Benchmarks, and Emerging Trends | multimodal learning, CLIP, Whisper, cross-modal retrieval, modality fusion, video-language models, contrastive learning | | Efficient NLP at Scale: A Review of Model Compression Techniques | model compression, knowledge distillation, pruning, quantization, TinyBERT, edge computing, latency-accuracy tradeoff | | Domain-Specific NLP: Adapting Models for Healthcare, Law, and Finance | domain adaptation, BioBERT, legal NLP, clinical text analysis, privacy-preserving NLP, terminology extraction, few-shot domain transfer | | Attention Heads of Large Language Models: A Survey | attention head, attention mechanism, large language model, LLM,transformer architecture, neural networks, natural language processing | | Controllable Text Generation for Large Language Models: A Survey | controlled text generation, text generation, large language model, LLM,natural language processing | | A survey on evaluation of large language models | evaluation of large language models,large language models assessment, natural language processing, AI model evaluation | | Large language models for generative information extraction: a survey | information extraction, large language models, LLM,natural language processing, generative AI, text mining | | Internal consistency and self feedback of LLM | Internal consistency, self feedback, large language model, LLM,natural language processing, model evaluation, AI reliability | | Review of Multi Agent Offline Reinforcement Learning | multi agent, offline policy, reinforcement learning,decentralized learning, cooperative agents, policy optimization | | Reasoning of large language model: A survey | reasoning of large language models, large language models, LLM,natural language processing, AI reasoning, transformer models | | Hierarchy Theorems in Computational Complexity: From Time-Space Tradeoffs to Oracle Separations | P vs NP, NP-completeness, polynomial hierarchy, space complexity, oracle separation, Cook-Levin theorem | | Classical Simulation of Quantum Circuits: Complexity Barriers and Implications | BQP, quantum supremacy, Shor's algorithm, post-quantum cryptography, QMA, hidden subgroup problem | | Kernelization: Theory, Techniques, and Limits | fixed-parameter tractable (FPT), kernelization, treewidth, W-hierarchy, ETH (Exponential Time Hypothesis), parameterized reduction | | Optimal Inapproximability Thresholds for Combinatorial Optimization Problems | PCP theorem, approximation ratio, Unique Games Conjecture, APX-hardness, gap-preserving reduction, LP relaxation | | [Hardness in P: When Poly
