VectorSmuggle
Testing platform for covert data exfiltration techniques where sensitive documents are embedded into vector representations and tunneled out under the guise of legitimate RAG operations — bypassing traditional security controls and evading detection through semantic obfuscation.
Install / Use
/learn @jaschadub/VectorSmuggleREADME

VectorSmuggle
"The smuggle is real!"
A comprehensive proof-of-concept demonstrating vector-based data exfiltration techniques in AI/ML environments. This project illustrates potential risks in RAG systems and provides tools and concepts for defensive analysis.
📋 Overview
VectorSmuggle demonstrates techniques for covert data exfiltration through vector embeddings, showcasing how sensitive information can be hidden within seemingly legitimate RAG operations. This research tool helps security professionals understand and defend against attack vectors in AI/ML systems.
Key Features
- 🎭 Steganographic Techniques: Embedding obfuscation and data hiding
- 📄 Multi-Format Support: Process 15+ document formats (PDF, Office, email, databases)
- 🕵️ Evasion Capabilities: Behavioral camouflage and detection avoidance
- 🔍 Enhanced Query Engine: Data reconstruction and analysis
- 🐳 Production-Ready: Full containerization and Kubernetes deployment
- 📊 Analysis Tools: Comprehensive forensic and risk assessment capabilities
🏗️ Architecture
graph TB
A[Document Sources] --> B[Multi-Format Loaders]
B --> C[Content Preprocessors]
C --> D[Steganography Engine]
D --> E[Evasion Layer]
E --> F[Vector Stores]
F --> G[Enhanced Query Engine]
G --> H[Analysis & Recovery Tools]
subgraph "Core Modules"
B
C
D
E
G
H
end
subgraph "External Services"
F
I[OpenAI API]
J[Monitoring Systems]
end
🚀 Quick Start
Prerequisites
- Python 3.11+
- OpenAI API key (or Ollama with nomic-embed-text:latest as fallback)
- Docker (optional)
- Kubernetes cluster (optional)
Installation
# Clone repository
git clone https://github.com/jaschadub/VectorSmuggle.git
cd VectorSmuggle
# Set up virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your API keys and settings
Basic Usage
# Embed documents with steganographic techniques
python scripts/embed.py --files sample_docs/*.pdf --techniques noise,rotation,fragmentation
# Query and reconstruct data
python scripts/query.py --mode recovery --export results.json
Interactive Demo
For a comprehensive demonstration of VectorSmuggle's capabilities, try the interactive quickstart demo:
# Run the complete workflow demonstration
cd examples
python quickstart_demo.py
# With deterministic results
python quickstart_demo.py --seed 42
# Test specific techniques
python quickstart_demo.py --techniques noise rotation fragmentation
The quickstart demo demonstrates:
- ✅ End-to-end workflow: Document loading → Steganographic embedding → Vector storage → Query reconstruction
- ✅ Multiple techniques: Noise injection, rotation, scaling, and fragmentation across models
- ✅ Real sample data: Processes documents from
sample_docs/(financial, HR, technical files) - ✅ Integrity verification: Validates successful encoding and decoding of hidden data
- ✅ Performance metrics: Shows processing times, success rates, and data statistics
Expected runtime: 10-30 seconds | Sample output: 6 documents → 45 chunks → 45 steganographic embeddings
See examples/README.md for detailed setup instructions, troubleshooting, and expected outputs.
📚 Documentation
Research Documentation
- 📖 Research Methodology - Research approach and validation
- ⚔️ Attack Vectors - Comprehensive attack analysis
- 🛡️ Defense Strategies - Countermeasures and detection
- ⚖️ Compliance Impact - Regulatory implications
- 🔬 Vector-Payload Dissociation - Dissociation technique analysis
Technical Documentation
- 🏗️ System Architecture - Design and components
- 📋 API Reference - Module documentation
- ⚙️ Configuration Guide - Setup and options
- 🔧 Troubleshooting - Common issues
- 🗄️ Multi-Database Architecture - Multi-DB testing design
Usage Guides
- 🚀 Quick Start Guide - Getting started
- 🎯 Advanced Usage - Complex scenarios
- 🔒 Security Testing - Testing procedures
- 🚢 Deployment Guide - Production deployment
- 🧪 Large-Scale Testing - Large-scale validation framework
- 🗄️ Multi-Database Setup - Multi-DB testing setup
- 🔀 Payload Dissociation Testing - Dissociation test guide
🔧 Core Components
Steganographic Engine
Advanced techniques for hiding data within vector embeddings:
from steganography import EmbeddingObfuscator, MultiModelFragmenter
# Apply noise-based steganography
obfuscator = EmbeddingObfuscator(noise_level=0.01)
hidden_embeddings = obfuscator.obfuscate(embeddings, techniques=["noise", "rotation"])
# Fragment across multiple models
fragmenter = MultiModelFragmenter()
fragments = fragmenter.fragment_and_embed(sensitive_data)
Multi-Format Document Processing
Support for diverse document types:
from loaders import DocumentLoaderFactory
factory = DocumentLoaderFactory()
documents = factory.load_documents([
"financial_report.pdf",
"employee_data.xlsx",
"emails.mbox",
"database_export.sqlite"
])
Evasion Capabilities
Sophisticated detection avoidance:
from evasion import BehavioralCamouflage, TrafficMimicry
# Simulate legitimate user behavior
camouflage = BehavioralCamouflage(legitimate_ratio=0.8)
camouflage.generate_cover_story("data analysis project")
# Mimic normal traffic patterns
mimicry = TrafficMimicry(base_interval=300.0)
await mimicry.execute_with_timing(upload_operation)
Enhanced Query Engine
Advanced data reconstruction:
from query import AdvancedQueryEngine, DataRecoveryTools
engine = AdvancedQueryEngine(vector_store, llm, embeddings)
recovery = DataRecoveryTools(embeddings)
# Multi-strategy search and reconstruction
results = engine.multi_strategy_search("sensitive financial data")
reconstructed = recovery.recover_data(results)
🔍 Analysis Tools
Risk Assessment
Comprehensive security risk evaluation:
from analysis.risk_assessment import VectorExfiltrationRiskAssessor
assessor = VectorExfiltrationRiskAssessor()
assessment = assessor.perform_comprehensive_assessment(
documents, embeddings, config
)
print(f"Risk Level: {assessment.overall_risk_level}")
Forensic Analysis
Digital forensics for incident investigation:
from analysis.forensic_tools import EvidenceCollector, TimelineReconstructor
collector = EvidenceCollector()
evidence = collector.collect_vector_store_evidence(vector_data)
reconstructor = TimelineReconstructor()
timeline = reconstructor.reconstruct_timeline(evidence)
Detection Signatures
Generate security detection rules:
from analysis.detection_signatures import StatisticalSignatureGenerator
generator = StatisticalSignatureGenerator()
generator.establish_baseline(clean_embeddings)
signatures = generator.generate_statistical_signatures()
Baseline Generation
Create legitimate traffic patterns:
from analysis.baseline_generator import BaselineDatasetGenerator
generator = BaselineDatasetGenerator()
dataset = generator.generate_baseline_dataset(
num_users=50, days=7
)
🐳 Deployment
Docker Deployment
# Development environment
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d
# Production environment
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
Kubernetes Deployment
# Deploy to Kubernetes
kubectl apply -f k8s/ -n vectorsmuggle
# Check deployment status
kubectl get pods -n vectorsmuggle
kubectl rollout status deployment/vectorsmuggle -n vectorsmuggle
Automated Deployment
# Full deployment with monitoring
./scripts/deploy/deploy.sh --environment production --platform kubernetes --build
# Health check and validation
./scripts/deploy/health-check.sh --detailed --export health-report.json
⚙️ Configuration
Environment Variables
# Core settings
OPENAI_API_KEY=sk-...
VECTOR_DB=qdrant
CHUNK_SIZE=512
# Embedding fallback settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text:latest
# Steganography settings
STEGO_ENABLED=true
STEGO_TECHNIQUES=noise,rotation,fragmentation
STEGO_NOISE_LEVEL=0.01
# Evasion settings
EVASION_TRAFFIC_MIMICRY=true
EVASION_BEHAVIORAL_CAMOUFLAGE=true
EVASION_LEGITIMATE_RATIO=0.8
# Query settings
QUERY_CACHE_ENABLED=true
QUERY_MULTI_STEP_REASONING=true
QUERY_CONTEXT_RECONSTRUCTION=true
Embedding Model Fallback
VectorSmuggle includes automatic fallback support for embedding models:
- Primary: OpenAI embeddings (requires API key)
- Fallback: Ollama with nomic-embed-text:latest (local)
