StageRAG
A blueprint for building production-ready RAG systems that minimize hallucination, featuring switchable 3-step (Speed) and 4-step (Precision) pipelines.
Install / Use
/learn @darrencxl0301/StageRAGREADME
StageRAG: A Framework for Building Hallucination-Resistant RAG Applications
StageRAG is a lightweight, production-ready RAG framework designed to give you precise control over the speed-versus-accuracy trade-off. It allows you to build high-factuality applications while gracefully managing uncertainty in LLM responses.
🌟 Features
- Dual-Mode Pipelines: Dynamically switch between two processing modes based on your needs:
- Speed Mode: 3-step pipeline (1B + 3B models, ~3-5s response)
- Precision Mode: 4-step pipeline (3B model, ~6-12s response)
- Easy Knowledge Base Integration: Deploy with your own data by providing a JSONL file in the standard conversation format. The system automatically builds vector indices and handles retrieval.
- Built-in Confidence Scoring: Every answer includes multi-component confidence evaluation (retrieval quality, answer structure, relevance, uncertainty detection). Programmatically handle low-confidence responses to reduce hallucinations.
- Optimized for Smaller Models: Built on Llama 3.2 1B and 3B models with 4-bit quantization support, requiring only 5-10GB GPU memory while maintaining quality.
📋 Prerequisites
1. Get Llama Model Access
You must request access to both Llama models:
- Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- Visit https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- Click "Access gated model" and accept the license
- Wait for approval (usually instant)
2. Login to HuggingFace
pip install huggingface-hub
huggingface-cli login
# Enter your HuggingFace token when prompted
Get your token from: https://huggingface.co/settings/tokens
3. System Requirements
- Python >= 3.8
- CUDA-capable GPU (recommended) or CPU
- 5GB+ RAM for 4-bit mode, 10GB+ for full precision
- Internet connection for initial model download
🚀 Installation
Clone Repository
git clone https://github.com/darrencxl0301/StageRAG.git
cd StageRAG
Install Dependencies
# Install main dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
# Setup the packages
python setup.py
📊 Download Sample Dataset
Option 1: Automatic Download (Recommended)
python scripts/download_data.py
This downloads the sample dataset from darren0301/domain-mix-qa-1k to data/data.jsonl.
Option 2: Manual Download
from datasets import load_dataset
import json
dataset = load_dataset("darren0301/domain-mix-qa-1k")
with open("data/data.jsonl", "w") as f:
for item in dataset["train"]:
json.dump({"conversations": item["conversations"]}, f)
f.write("\n")
Option 3: Use Your Own Data
Create a JSONL file with this format:
{"conversations": [{"role": "user", "content": "What is EPF?"}, {"role": "assistant", "content": "EPF is the Employees Provident Fund..."}]}
{"conversations": [{"role": "user", "content": "How to apply for leave?"}, {"role": "assistant", "content": "To apply for leave..."}]}
💻 Usage
Interactive Chat Demo
# Basic usage (CPU)
python demo/interactive_demo.py --rag_dataset data/data.jsonl
# With GPU and 4-bit quantization (recommended)
python demo/interactive_demo.py --rag_dataset data/data.jsonl --use_4bit --device cuda
Interactive Commands:
mode speed- Switch to speed mode (3-step)mode precision- Switch to precision mode (4-step)cache stats- View cache performancesearch <query>- Test RAG retrievalquitorq- Exit
Basic Usage Example
python demo/basic_usage.py --rag_dataset data/data.jsonl
Programmatic Usage
from stagerag import StageRAGSystem
import argparse
# Setup configuration
args = argparse.Namespace(
rag_dataset='data/data.jsonl',
device='cuda',
use_4bit=True,
cache_size=1000,
temperature=0.7,
top_p=0.85,
max_new_tokens=512,
max_seq_len=2048,
disable_rag=False,
rag_threshold=0.3,
seed=42
)
# Initialize system
system = StageRAGSystem(args)
# Process query
result = system.process_query(
"What are the EPF contribution rates?",
mode="speed"
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']['overall_confidence']:.3f}")
print(f"Time: {result['processing_time']:.2f}s")
🧪 Testing
# Install test dependencies
pip install pytest pytest-cov
# Run all tests
pytest tests/ -v
# Run specific test files
pytest tests/test_cache.py -v
pytest tests/test_confidence.py -v
pytest tests/test_rag.py -v
# Run with detailed output
pytest tests/test_cache.py -vv
# Run with coverage report
pytest tests/ --cov=stagerag --cov-report=html
⚙️ Configuration
Command Line Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| --rag_dataset | Required | Path to JSONL knowledge base |
| --device | cuda | Device to use (cuda/cpu) |
| --use_4bit | False | Enable 4-bit quantization |
| --cache_size | 1000 | LRU cache size |
| --temperature | 0.7 | Sampling temperature (0.0-1.0) |
| --top_p | 0.85 | Top-p nucleus sampling |
| --max_new_tokens | 512 | Max tokens to generate |
| --disable_rag | False | Disable RAG retrieval |
Confidence Weights
Edit stagerag/config.py to adjust confidence evaluation:
weights = {
'retrieval': 0.25, # RAG retrieval quality
'basic_quality': 0.25, # Answer structure/length
'relevance': 0.25, # Keyword relevance
'uncertainty': 0.25 # Uncertainty detection
}
📁 Project Structure
StageRAG/
├── stagerag/ # Main package
│ ├── __init__.py # Package exports
│ ├── main.py # StageRAGSystem class
│ ├── cache.py # LRU cache implementation
│ ├── confidence.py # Confidence evaluator
│ ├── rag.py # RAG retrieval system
│ ├── prompts.py # Prompt templates
│ └── config.py # Configuration dataclasses
├── demo/ # Usage examples
│ ├── interactive_demo.py
│ └── basic_usage.py
├── scripts/ # Utility scripts
│ └── download_data.py # HuggingFace dataset downloader
├── tests/ # Test suite
│ ├── test_cache.py
│ ├── test_confidence.py
│ └── test_rag.py
├── data/ # Knowledge base (created on first run)
│ └── data.jsonl
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development dependencies
├── setup.py # Package configuration
└── README.md
🎯 Architecture
Speed Mode (3-step Pipeline)
User Input → [1B] Normalize → [3B] RAG Filter → [1B] Generate Answer → Response
Precision Mode (4-step Pipeline)
User Input → [1B] Normalize → [3B] RAG Retrieve → [3B] Synthesize → [3B] Final Answer → Response
📊 Performance Benchmarks
| Mode | Avg Time | Avg Confidence | Use Case | |------|----------|----------------|----------| | Speed | 3.3s | 0.72 | Real-time chat | | Precision | 7.8s | 0.83 | Complex queries, critical decisions |
Tested on NVIDIA RTX 3090 GPU with 4-bit quantization
📦 Dataset
Sample dataset: darren0301/domain-mix-qa-1k
Contains 1,000 domain-specific Q&A pairs covering:
- Logical & Mathematical Reasoning
- Specialized Medical Domain Knowledge
- Open-Ended General Instruction Following
- Employee benefits information
- Practical, Real-World Q&A
🤝 Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
📚 Citation
@software{stagerag2024,
author = {Darren Chai Xin Lun},
title = {StageRAG: A Framework for Building Hallucination-Resistant RAG Applications},
year = {2024},
url = {https://github.com/darrencxl0301/StageRAG},
note = {Dataset: https://huggingface.co/datasets/darren0301/domain-mix-qa-1k}
}
🙏 Acknowledgments
- Built with Llama 3.2 models by Meta
- FAISS for vector similarity search
- Sentence Transformers for embeddings
- HuggingFace for model hosting
📧 Contact
Darren Chai Xin Lun
- GitHub: @darrencxl0301
- HuggingFace: @darren0301
⭐ If you find this project helpful, please give it a star!
