StageRAG: A Framework for Building Hallucination-Resistant RAG Applications

StageRAG is a lightweight, production-ready RAG framework designed to give you precise control over the speed-versus-accuracy trade-off. It allows you to build high-factuality applications while gracefully managing uncertainty in LLM responses.

🌟 Features

Dual-Mode Pipelines: Dynamically switch between two processing modes based on your needs:
- Speed Mode: 3-step pipeline (1B + 3B models, ~3-5s response)
- Precision Mode: 4-step pipeline (3B model, ~6-12s response)
Easy Knowledge Base Integration: Deploy with your own data by providing a JSONL file in the standard conversation format. The system automatically builds vector indices and handles retrieval.
Built-in Confidence Scoring: Every answer includes multi-component confidence evaluation (retrieval quality, answer structure, relevance, uncertainty detection). Programmatically handle low-confidence responses to reduce hallucinations.
Optimized for Smaller Models: Built on Llama 3.2 1B and 3B models with 4-bit quantization support, requiring only 5-10GB GPU memory while maintaining quality.

📋 Prerequisites

1. Get Llama Model Access

You must request access to both Llama models:

Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
Visit https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
Click "Access gated model" and accept the license
Wait for approval (usually instant)

2. Login to HuggingFace

pip install huggingface-hub
huggingface-cli login
# Enter your HuggingFace token when prompted

Get your token from: https://huggingface.co/settings/tokens

3. System Requirements

Python >= 3.8
CUDA-capable GPU (recommended) or CPU
5GB+ RAM for 4-bit mode, 10GB+ for full precision
Internet connection for initial model download

🚀 Installation

Clone Repository

git clone https://github.com/darrencxl0301/StageRAG.git
cd StageRAG

Install Dependencies

# Install main dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

# Setup the packages
python setup.py

📊 Download Sample Dataset

Option 1: Automatic Download (Recommended)

python scripts/download_data.py

This downloads the sample dataset from darren0301/domain-mix-qa-1k to data/data.jsonl.

Option 2: Manual Download

from datasets import load_dataset
import json

dataset = load_dataset("darren0301/domain-mix-qa-1k")

with open("data/data.jsonl", "w") as f:
    for item in dataset["train"]:
        json.dump({"conversations": item["conversations"]}, f)
        f.write("\n")

Option 3: Use Your Own Data

Create a JSONL file with this format:

{"conversations": [{"role": "user", "content": "What is EPF?"}, {"role": "assistant", "content": "EPF is the Employees Provident Fund..."}]}
{"conversations": [{"role": "user", "content": "How to apply for leave?"}, {"role": "assistant", "content": "To apply for leave..."}]}

💻 Usage

Interactive Chat Demo

# Basic usage (CPU)
python demo/interactive_demo.py --rag_dataset data/data.jsonl

# With GPU and 4-bit quantization (recommended)
python demo/interactive_demo.py --rag_dataset data/data.jsonl --use_4bit --device cuda

Interactive Commands:

mode speed - Switch to speed mode (3-step)
mode precision - Switch to precision mode (4-step)
cache stats - View cache performance
search <query> - Test RAG retrieval
quit or q - Exit

Basic Usage Example

python demo/basic_usage.py --rag_dataset data/data.jsonl

Programmatic Usage

from stagerag import StageRAGSystem
import argparse

# Setup configuration
args = argparse.Namespace(
    rag_dataset='data/data.jsonl',
    device='cuda',
    use_4bit=True,
    cache_size=1000,
    temperature=0.7,
    top_p=0.85,
    max_new_tokens=512,
    max_seq_len=2048,
    disable_rag=False,
    rag_threshold=0.3,
    seed=42
)

# Initialize system
system = StageRAGSystem(args)

# Process query
result = system.process_query(
    "What are the EPF contribution rates?",
    mode="speed"
)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']['overall_confidence']:.3f}")
print(f"Time: {result['processing_time']:.2f}s")

🧪 Testing

# Install test dependencies
pip install pytest pytest-cov

# Run all tests
pytest tests/ -v

# Run specific test files
pytest tests/test_cache.py -v
pytest tests/test_confidence.py -v
pytest tests/test_rag.py -v

# Run with detailed output
pytest tests/test_cache.py -vv

# Run with coverage report
pytest tests/ --cov=stagerag --cov-report=html

⚙️ Configuration

Command Line Arguments

| Argument | Default | Description | |----------|---------|-------------| | --rag_dataset | Required | Path to JSONL knowledge base | | --device | cuda | Device to use (cuda/cpu) | | --use_4bit | False | Enable 4-bit quantization | | --cache_size | 1000 | LRU cache size | | --temperature | 0.7 | Sampling temperature (0.0-1.0) | | --top_p | 0.85 | Top-p nucleus sampling | | --max_new_tokens | 512 | Max tokens to generate | | --disable_rag | False | Disable RAG retrieval |

Confidence Weights

Edit stagerag/config.py to adjust confidence evaluation:

weights = {
    'retrieval': 0.25,      # RAG retrieval quality
    'basic_quality': 0.25,  # Answer structure/length
    'relevance': 0.25,      # Keyword relevance
    'uncertainty': 0.25     # Uncertainty detection
}

📁 Project Structure

StageRAG/
├── stagerag/              # Main package
│   ├── __init__.py       # Package exports
│   ├── main.py           # StageRAGSystem class
│   ├── cache.py          # LRU cache implementation
│   ├── confidence.py     # Confidence evaluator
│   ├── rag.py            # RAG retrieval system
│   ├── prompts.py        # Prompt templates
│   └── config.py         # Configuration dataclasses
├── demo/                 # Usage examples
│   ├── interactive_demo.py
│   └── basic_usage.py
├── scripts/              # Utility scripts
│   └── download_data.py  # HuggingFace dataset downloader
├── tests/                # Test suite
│   ├── test_cache.py
│   ├── test_confidence.py
│   └── test_rag.py
├── data/                 # Knowledge base (created on first run)
│   └── data.jsonl
├── requirements.txt      # Production dependencies
├── requirements-dev.txt  # Development dependencies
├── setup.py             # Package configuration
└── README.md

🎯 Architecture

Speed Mode (3-step Pipeline)

User Input → [1B] Normalize → [3B] RAG Filter → [1B] Generate Answer → Response

Precision Mode (4-step Pipeline)

User Input → [1B] Normalize → [3B] RAG Retrieve → [3B] Synthesize → [3B] Final Answer → Response

📊 Performance Benchmarks

| Mode | Avg Time | Avg Confidence | Use Case | |------|----------|----------------|----------| | Speed | 3.3s | 0.72 | Real-time chat | | Precision | 7.8s | 0.83 | Complex queries, critical decisions |

Tested on NVIDIA RTX 3090 GPU with 4-bit quantization

📦 Dataset

Sample dataset: darren0301/domain-mix-qa-1k

Contains 1,000 domain-specific Q&A pairs covering:

Logical & Mathematical Reasoning
Specialized Medical Domain Knowledge
Open-Ended General Instruction Following
Employee benefits information
Practical, Real-World Q&A

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

@software{stagerag2024,
  author = {Darren Chai Xin Lun},
  title = {StageRAG: A Framework for Building Hallucination-Resistant RAG Applications},
  year = {2024},
  url = {https://github.com/darrencxl0301/StageRAG},
  note = {Dataset: https://huggingface.co/datasets/darren0301/domain-mix-qa-1k}
}

🙏 Acknowledgments

Built with Llama 3.2 models by Meta
FAISS for vector similarity search
Sentence Transformers for embeddings
HuggingFace for model hosting

📧 Contact

Darren Chai Xin Lun

GitHub: @darrencxl0301
HuggingFace: @darren0301

⭐ If you find this project helpful, please give it a star!

StageRAG

Install / Use

README