CodeWiki
Open-source framework for holistic, structured repository-level documentation across multilingual codebases
Install / Use
/learn @FSoft-AI4Code/CodeWikiREADME
Quick Start
1. Install CodeWiki
# Install from source
pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git
# Verify installation
codewiki --version
2. Configure Your Environment
CodeWiki supports multiple models via an OpenAI-compatible SDK layer.
codewiki config set \
--api-key YOUR_API_KEY \
--base-url https://api.anthropic.com \
--main-model claude-sonnet-4 \
--cluster-model claude-sonnet-4 \
--fallback-model glm-4p5
3. Generate Documentation
# Navigate to your project
cd /path/to/your/project
# Generate documentation
codewiki generate
# Generate with HTML viewer for GitHub Pages
codewiki generate --github-pages --create-branch
That's it! Your documentation will be generated in ./docs/ with comprehensive repository-level analysis.
Usage Example

What is CodeWiki?
CodeWiki is an open-source framework for automated repository-level documentation across eight programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.
Key Innovations
| Innovation | Description | Impact | |------------|-------------|--------| | Hierarchical Decomposition | Dynamic programming-inspired strategy that preserves architectural context | Handles codebases of arbitrary size (86K-1.4M LOC tested) | | Recursive Agentic System | Adaptive multi-agent processing with dynamic delegation capabilities | Maintains quality while scaling to repository-level scope | | Multi-Modal Synthesis | Generates textual documentation, architecture diagrams, data flows, and sequence diagrams | Comprehensive understanding from multiple perspectives |
Supported Languages
🐍 Python • ☕ Java • 🟨 JavaScript • 🔷 TypeScript • ⚙️ C • 🔧 C++ • 🪟 C# • 🎯 Kotlin
CLI Commands
Configuration Management
# Set up your API configuration
codewiki config set \
--api-key <your-api-key> \
--base-url <provider-url> \
--main-model <model-name> \
--cluster-model <model-name> \
--fallback-model <model-name>
# Configure max token settings
codewiki config set --max-tokens 32768 --max-token-per-module 36369 --max-token-per-leaf-module 16000
# Configure max depth for hierarchical decomposition
codewiki config set --max-depth 3
# Show current configuration
codewiki config show
# Validate your configuration
codewiki config validate
Documentation Generation
# Basic generation
codewiki generate
# Custom output directory
codewiki generate --output ./documentation
# Create git branch for documentation
codewiki generate --create-branch
# Generate HTML viewer for GitHub Pages
codewiki generate --github-pages
# Enable verbose logging
codewiki generate --verbose
# Full-featured generation
codewiki generate --create-branch --github-pages --verbose
Customization Options
CodeWiki supports customization for language-specific projects and documentation styles:
# C# project: only analyze .cs files, exclude test directories
codewiki generate --include "*.cs" --exclude "Tests,Specs,*.test.cs"
# Focus on specific modules with architecture-style docs
codewiki generate --focus "src/core,src/api" --doc-type architecture
# Add custom instructions for the AI agent
codewiki generate --instructions "Focus on public APIs and include usage examples"
Pattern Behavior (Important!)
-
--include: When specified, ONLY these patterns are used (replaces defaults completely)- Example:
--include "*.cs"will analyze ONLY.csfiles - If omitted, all supported file types are analyzed
- Supports glob patterns:
*.py,src/**/*.ts,*.{js,jsx}
- Example:
-
--exclude: When specified, patterns are MERGED with default ignore patterns- Example:
--exclude "Tests,Specs"will exclude these directories AND still exclude.git,__pycache__,node_modules, etc. - Default patterns include:
.git,node_modules,__pycache__,*.pyc,bin/,dist/, and many more - Supports multiple formats:
- Exact names:
Tests,.env,config.local - Glob patterns:
*.test.js,*_test.py,*.min.* - Directory patterns:
build/,dist/,coverage/
- Exact names:
- Example:
Setting Persistent Defaults
Save your preferred settings as defaults:
# Set include patterns for C# projects
codewiki config agent --include "*.cs"
# Exclude test projects by default (merged with default excludes)
codewiki config agent --exclude "Tests,Specs,*.test.cs"
# Set focus modules
codewiki config agent --focus "src/core,src/api"
# Set default documentation type
codewiki config agent --doc-type architecture
# View current agent settings
codewiki config agent
# Clear all agent settings
codewiki config agent --clear
| Option | Description | Behavior | Example |
|--------|-------------|----------|---------|
| --include | File patterns to include | Replaces defaults | *.cs, *.py, src/**/*.ts |
| --exclude | Patterns to exclude | Merges with defaults | Tests,Specs, *.test.js, build/ |
| --focus | Modules to document in detail | Standalone option | src/core,src/api |
| --doc-type | Documentation style | Standalone option | api, architecture, user-guide, developer |
| --instructions | Custom agent instructions | Standalone option | Free-form text |
Token Settings
CodeWiki allows you to configure maximum token limits for LLM calls. This is useful for:
- Adapting to different model context windows
- Controlling costs by limiting response sizes
- Optimizing for faster response times
# Set max tokens for LLM responses (default: 32768)
codewiki config set --max-tokens 16384
# Set max tokens for module clustering (default: 36369)
codewiki config set --max-token-per-module 40000
# Set max tokens for leaf modules (default: 16000)
codewiki config set --max-token-per-leaf-module 20000
# Set max depth for hierarchical decomposition (default: 2)
codewiki config set --max-depth 3
# Override at runtime for a single generation
codewiki generate --max-tokens 16384 --max-token-per-module 40000 --max-depth 3
| Option | Description | Default |
|--------|-------------|---------|
| --max-tokens | Maximum output tokens for LLM response | 32768 |
| --max-token-per-module | Input tokens threshold for module clustering | 36369 |
| --max-token-per-leaf-module | Input tokens threshold for leaf modules | 16000 |
| --max-depth | Maximum depth for hierarchical decomposition | 2 |
Configuration Storage
- API keys: Securely stored in system keychain (macOS Keychain, Windows Credential Manager, Linux Secret Service)
- Settings & Agent Instructions:
~/.codewiki/config.json
Documentation Output
Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.
Textual Documentation
- Repository overview with architecture guide
- Module-level documentation with API references
- Usage examples and implementation patterns
- Cross-module interaction analysis
Visual Artifacts
- System architecture diagrams (Mermaid)
- Data flow visualizations
- Dependency graphs and module relationships
- Sequence diagrams for complex interactions
Output Structure
./docs/
├── overview.md # Repository overview (start here!)
├── module1.md # Module documentation
├── module2.md # Additional modules...
├── module_tree.json # Hierarchical module structure
├── first_module_tree.json # Initial clustering result
├── metadata.json # Generation metadata
└── index.html # Interactive viewer (with --github-pages)
Experimental Results
CodeWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.
Performance by Language Category
| Language Category | CodeWiki (Sonnet-4) | DeepWiki | Improvement | |-------------------|---------------------|----------|-------------| | High-Level (Python, JS, TS) | 79.14% | 68.67% | +10.47% | | Managed (C#, Java) | 68.84% | 64.80% | +4.04% | | Systems (C, C++) | 53.24% | 56.39% | -3.15% | |
