Skwaq
Skwaq (pron. skwak - based on Lushootseed word for Raven) is an AI Agent system for researching software vulnerabilities
Install / Use
/learn @rysweet/SkwaqREADME
Skwaq
A self-improving, multi-agent vulnerability analyzer.
Website | Benchmark Progress | Specification | Code Atlas
Skwaq uses a team of 18 specialized AI agents to investigate source code and binaries for security vulnerabilities. It builds a code property graph in LadybugDB, traces how untrusted user input propagates through code (taint analysis), and uses multi-agent debate to reason about exploitability. The agents are powered by RustyClawd, a Rust-based agentic LLM framework.
What makes it unique: skwaq improves itself. A built-in benchmark harness (Skwaq Gym) measures detection accuracy against 6 industry benchmarks, and a self-improvement loop uses AI agents to analyze their own failures and propose better investigation strategies — with an overfitting-reviewer agent that rejects ~66% of proposals to prevent building to the benchmark.
The name comes from the Lushootseed word for Raven — the trickster who reveals hidden truths.
Quick Start
# Analyze a binary
skwaq ingest binary /usr/bin/target
skwaq analyze --quick
skwaq report --sarif
# Check binary hardening
skwaq checksec /usr/bin/target
# View findings
skwaq viz findings
skwaq report --json
Install
From Source
git clone https://github.com/rysweet/skwaq
cd skwaq
cargo build --release
# Binary at ./target/release/skwaq
Prerequisites
- Rust 1.70+ (for building)
- Ghidra (optional, for decompilation) - set
GHIDRA_INSTALL_DIR - Python 3.10+ (optional, for angr symbolic execution)
- Semgrep (optional, for pattern matching) -
pip install semgrep
Run skwaq doctor to check what's available.
Commands
Ingestion
skwaq ingest binary <path> # Ingest ELF/PE binary
skwaq ingest source <path> # Ingest source code (coming soon)
Binary Inspection
skwaq checksec <binary> # Binary hardening assessment
skwaq strings <binary> # Extract printable strings
skwaq symbols <binary> # List symbols and imports
skwaq surface # Show attack surface
skwaq xrefs <function> # Cross-references
Analysis
skwaq analyze --quick # Pattern detection + taint analysis
skwaq analyze --investigation <id> # Analyze specific investigation
skwaq agents list # List installed agents and their role cards
skwaq analyze --quick now prints a SEMANTIC column for discovered and final
findings. This surfaces stable vulnerability classes such as
buffer_overflow, format_string, and command_injection, even when later
cycles challenge the initial coarse finding.
skwaq agents list now includes each agent's structured role title and any
declared output schema, which is useful for verifying which specialization
cards and schema-backed contracts are active in the current checkout, including
debate-stage schemas such as exploit-analyst-v1 and defense-analyst-v1.
When structured exploit/defense outputs parse successfully, the deep debate
pipeline emits confidence-threshold hints in its weighted summary so the final
synthesizer can bias ambiguous findings toward rejection unless direct code
evidence is strong. If structured parsing fails, the debate summary now marks
those hints unavailable and falls back to direct code review.
HIGH_CONFIDENCE_CONFIRM is intentionally exploitability-led: it requires a
strong exploit-side signal plus supporting defense agreement, rather than any
net-positive score automatically promoting to confirm.
When a threshold_hint is present, it is the auto-confirm/auto-reject gate:
REVIEW_REQUIRED means the synthesizer should not auto-confirm from raw
category pairs alone, even if the debate text includes CONFIRMED,
VULNERABLE, MITIGATED, or DOWNGRADED.
Investigation
skwaq investigate list # List investigations
skwaq annotate <addr> "note" # Add annotation
skwaq hypothesize "theory" # Record hypothesis
Reporting
skwaq report # Markdown report (default)
skwaq report --sarif # SARIF for CI/CD
skwaq report --json # JSON output
Visualization
skwaq viz findings # Findings table
skwaq viz callgraph # Call graph tree
Knowledge Base
skwaq kb init # Seed the CWE catalog and validate knowledge packs
skwaq kb search "buffer" # Search initialized CWE + knowledge-pack entries
skwaq kb search "cwe-119 buffer overflow" --json
Run skwaq kb init once per workspace before searching. kb search uses the same
shared backend as agent knowledge lookup and can return mixed CWE and knowledge-pack
results; use --json for automation.
System
skwaq doctor # Check prerequisites
skwaq config show # Show configuration
skwaq gym preflight # Verify Copilot benchmark readiness
skwaq version # Show version
Gym Self-Improvement
skwaq gym run fixtures --quick # Baseline benchmark
skwaq gym improve fixtures # Run improvement cycle
skwaq gym run fixtures --quick # Verify improvement
skwaq gym compare # Show score delta
skwaq gym case-diff # Per-case outcome changes
The gym improve command analyzes detection failures, proposes targeted fixes
(new patterns, CWE mappings, taint rules), reviews them for overfitting via an
LLM reviewer agent, and applies accepted patches. See
docs/gym-self-improvement.md for the full guide.
See docs/detection-coverage.md for how semantic
classification, CWE family mapping, and scoring interact.
Model Comparison with Profiles
skwaq gym profile create opus --backend copilot --model claude-opus-4.6
skwaq gym run fixtures --quick --profile opus
skwaq gym dashboard --tui --profile opus
skwaq gym profiles # List all profiles
Profiles provide isolated state (results DB, memory graph, telemetry) for reproducible multi-model evaluation. See docs/gym-profiles.md for the full reference.
Architecture
Three Rust crates:
- skwaq-core: Binary parsing (goblin), graph database (LadybugDB/SQLite), analysis engine, 18 agent definitions, LLM client via RustyClawd, durable agent memory
- skwaq-gym: Benchmark harness, 6 industry adapters, self-improvement loop with failure-analyst and overfitting-reviewer agents
- skwaq (cli): clap-based CLI with 20+ commands
CLI (clap) -> Analysis Engine -> Graph DB (LadybugDB)
| |
18 LLM Agents Binary Parser
(via RustyClawd) (goblin)
See the website for the full multi-agent pipeline diagram and benchmark results.
Configuration
Create skwaq.toml in your project directory:
[llm]
reasoning = "copilot" # default; or "anthropic" (requires ANTHROPIC_API_KEY)
decompilation = "copilot" # backend for decompile-* stages; no hidden fallback
[llm.copilot]
model = "claude-opus-4.6" # default model for Copilot backend
[llm.ollama]
host = "http://localhost:11434"
model = "llama3.1"
[binary]
ghidra_path = "/opt/ghidra"
LLM Backend
Skwaq supports three LLM backends:
| Backend | Config | Auth |
|---------|--------|------|
| GitHub Copilot | reasoning = "copilot" | gh auth login (needs copilot scope) |
| Azure AI Foundry | reasoning = "azure" | az login or AZURE_OPENAI_API_KEY |
| Anthropic | reasoning = "anthropic" | ANTHROPIC_API_KEY |
# Azure AI Foundry (GPT-5.4)
[llm]
reasoning = "azure"
[llm.azure]
endpoint = "https://your-resource.cognitiveservices.azure.com/"
deployment = "gpt-54-skwaq"
api_version = "2024-10-21"
# GitHub Copilot (Claude Opus)
[llm]
reasoning = "copilot"
[llm.copilot]
model = "claude-opus-4.6"
Use skwaq gym preflight to verify your LLM configuration before benchmark runs.
Dashboard & Telemetry
skwaq gym dashboard --live # Real-time TUI with active jobs, ETA, agent stats
skwaq gym dashboard --tui # Static snapshot
skwaq gym telemetry query # Query OTEL spans
The dashboard shows per-suite F1/precision/recall, which model produced results, active jobs with progress and ETA, agent call stats, and API health. OTEL spans are exported to ~/.skwaq/telemetry/spans.jsonl.
Running from Any Directory
Set SKWAQ_ROOT to use the installed binary from anywhere:
export SKWAQ_ROOT=/path/to/skwaq
skwaq gym dashboard --live
Infrastructure Setup
Deploy Azure AI Foundry models (idempotent):
bash infra/azure/setup.sh
BinPool Benchmark
The BinPool suite requires manual download from Zenodo.
After downloading, extract to ~/.local/share/skwaq/gym/cache/binpool/binpool_artifact/ and run skwaq gym setup.
Latest Benchmark Results (GPT-5.4 via Azure AI Foundry)
| Suite | F1% | Precision% | Recall% | TP | FP | FN | |-------|-----|-----------|--------|----|----|-----| | CGC | 94.3 | 100.0 | 89.2 | 497 | 0 | 60 | | CyberGym | 94.7 | 100.0 | 89.8 | 531 | 0 | 60 | | CyberSecEval | 93.9 | 100.0 | 88.6 | 441 | 0 | 57 | | Fixtures | 94.1 | 100.0 | 88.9 | 160 | 0 | 20 | | OWASP | 90.2 | 100.0 | 82.1 | 533 | 0 | 116 | | Juliet | 59.0 | 100.0 | 41.8 | 341 | 0 | 474 |
100% precision across all suites (zero false positives).
Pattern-Only Detection (Latest, 2026-03-31)
| Suite | Cases | F1% | P% | R% | |-------|-------|-----|----|----| | Fixtures | 99 | 93.7 | 98.1 | 89.3 | | OWASP | 500 | 93.8 | 100.0 | 88.3 | | CyberSecEval | 578 | 91.8 | 100.0 | 84.
