SkillAgentSearch skills...

ATLAS

Adaptive Test-time Learning and Autonomous Specialization

Install / Use

/learn @itigges22/ATLAS
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="docs/images/banner.png" alt="A.T.L.A.S" width="100%"> </p> <p align="center"> <img src="https://img.shields.io/badge/license-Source%20Available-blue" alt="License"> <img src="https://img.shields.io/badge/python-3.10+-green" alt="Python"> <img src="https://img.shields.io/badge/platform-K3s%20%7C%20K8s-blue" alt="K8s"> <img src="https://img.shields.io/badge/GPU-RTX%205060%20Ti%2016GB-green" alt="GPU"> <img src="https://img.shields.io/badge/status-v3.0-blue" alt="Status"> </p> <h1 align="center">A.T.L.A.S</h1> <p align="center"><b>Adaptive Test-time Learning and Autonomous Specialization</b></p>

A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box.


Benchmark Results

Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)

| Benchmark | Score | Tasks | Method | |-----------|-------|-------|--------| | LiveCodeBench v5 | 74.6% pass@1-v(k=3)* | 599 | V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score | | GPQA Diamond | 47.0% | 198 | k=5, multiple-choice knowledge reasoning, V2 Score | | SciCode | 14.7% (sub-problems) | 341 | k=1, cross-domain scientific coding, V2 Score |

*pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation, it is not pass@1. See methodology.

<details> <summary><b>V3 ablation breakdown</b></summary>

| Condition | Configuration | Pass Rate | Delta | |-----------|---------------|-----------|-------| | A | Baseline (no V3) | 54.9% | -- | | B | +Phase 1 (PlanSearch + BudgetForcing + DivSampling) | 67.3% | +12.4pp | | C | +Phase 1+2 (Lens routing) | 67.3% | +0.0pp | | D | +Phase 1+3 (self-verified refinement) | 74.6% | +7.3pp |

Phase 3 uses self-generated test cases for internal verification -- the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md

</details>

Cost and Performance Context

| System | LCB pass@1 | Est. cost/task | Notes | |--------|-----------|----------------|-------| | DeepSeek V3.2 Reasoning | 86.2% | ~$0.002 | API, single-shot | | GPT-5 (high) | 84.6% | ~$0.043 | API, single-shot | | ATLAS V3 (pass@1-v(k=3)) | 74.6% | ~$0.004 | Local electricity only, best-of-3 + repair pipeline | | Claude 4.5 Sonnet | 71.4% | ~$0.066 | API, single-shot | | Claude 4 Sonnet | 65.5% | ~$0.066 | API, single-shot |

<details> <summary><b>Methodology notes & sources</b></summary>

Methodology notes: ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head. API costs assume ~2,000 input + ~4,000 output tokens per task at current pricing. ATLAS cost = electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost -- the pipeline takes longer per task than a single API call, but no data leaves the machine.

Sources: Artificial Analysis LCB Leaderboard | AA Benchmarking Methodology | LiveCodeBench Paper (arXiv) | LCB Dataset (HuggingFace) | Pricing: OpenAI, Anthropic, DeepSeek

</details>

How It Works

flowchart LR
  subgraph Phase1["Phase 1: Generate"]
    PS[PlanSearch<br/>Constraint extraction<br/>+ diverse plans]
    BF[Budget Forcing<br/>Thinking token<br/>control]
  end

  subgraph Verify["Score + Test"]
    GL[Geometric Lens<br/>C x energy scoring<br/>5120-dim self-embeddings]
    SB[Sandbox<br/>Code execution]
  end

  subgraph Phase3["Phase 3: Repair"]
    ST[Self-Test Gen<br/>Model-generated<br/>I/O pairs]
    PR[PR-CoT Repair<br/>Multi-perspective<br/>chain-of-thought]
  end

  PS --> BF
  BF -->|k=3 candidates| GL
  GL -->|energy-sorted| SB
  SB -->|all fail| ST
  ST --> PR
  PR -->|repaired code| SB

  style GL fill:#2d5016,color:#fff
  style PS fill:#1a3a5c,color:#fff
  style BF fill:#1a3a5c,color:#fff
  style SB fill:#2d5016,color:#fff
  style ST fill:#5c3a1a,color:#fff
  style PR fill:#5c3a1a,color:#fff

A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s) and 5120-dim self-embeddings for Lens scoring. The Geometric Lens C(x) energy field selects the best candidate (87.8% accuracy on mixed-result tasks). Failed tasks enter Phase 3, where the model generates its own test cases and iteratively repairs solutions via PR-CoT -- real tests are used only for final scoring.

Full architecture: docs/ARCHITECTURE.md


Quick Start

Before you begin: ATLAS was developed and tested on specific hardware. Read the Hardware & Reproduction section below to check compatibility and tune variables for your setup before running.

git clone https://github.com/itigges22/ATLAS.git && cd ATLAS

cp atlas.conf.example atlas.conf    # set MODEL_PATH, DATA_DIR, GPU device
sudo ./scripts/install.sh
./scripts/verify-install.sh

# Run V3 benchmark
python3 benchmark/v3_runner.py

See docs/SETUP.md for full installation instructions.


Hardware & Reproduction

| Resource | Minimum | Tested | |----------|---------|--------| | GPU VRAM | 16 GB | RTX 5060 Ti 16 GB | | System RAM | 14 GB | 16 GB | | Python | 3.10+ | 3.11 | | OS | RHEL 9 / Ubuntu 24 | RHEL 9 (Proxmox VM) |

<details> <summary><b>Reproduction details</b></summary>

V3 results were produced on RHEL 9 running as a Proxmox VM with an RTX 5060 Ti 16GB passed through via VFIO. Other NVIDIA GPUs with 16GB+ VRAM should work, though you may need to adjust driver versions and VRAM allocation.

The pipeline is not yet plug-and-play on arbitrary hardware -- V3.1 will improve portability. That said, Claude Code can be used to retrofit the pipeline to your specific setup (different GPU, OS, VRAM budget).

Key variables to tune for your hardware:

  • --parallel slots (default 2 -- reduce to 1 if VRAM is tight)
  • KV cache quantization (Q4_0 -- see ARCHITECTURE.md for VRAM breakdown)
  • Context per slot (default 20480 tokens)
  • CUDA driver version (tested on CUDA 12.8)

Full VRAM budget breakdown is documented in docs/ARCHITECTURE.md. Community reproduction attempts are welcome -- open an issue with your hardware config and results.

</details>

Project Structure

benchmark/       Benchmark suite (V2 runner, V3 pipeline, datasets)
benchmark/v3/    V3 subsystems (16 modules: PlanSearch, BudgetForcing, PR-CoT, etc.)
rag-api/         Core API: Geometric Lens, confidence router, RAG, cache
llama-server/    Patched llama.cpp server (spec decode + self-embeddings)
manifests/       K3s deployment manifests
scripts/         Installation and management scripts
tests/           Test suite (infrastructure, integration, V3)
docs/            Architecture, setup, configuration, troubleshooting
api-portal/      API key management portal (JWT auth, web UI)
sandbox/         Isolated code execution environment

Documentation

| Document | Description | |----------|-------------| | ARCHITECTURE.md | System architecture, component deep-dives, data flows | | V3_ABLATION_STUDY.md | V3 ablation results and phase contribution analysis | | SETUP.md | Installation and deployment guide | | CONFIGURATION.md | Configuration reference (including all V3 toggles) | | TROUBLESHOOTING.md | Common issues and solutions | | API.md | API endpoint documentation |

<details> <summary><b>Historical documentation</b></summary>

| Document | Description | |----------|-------------| | V2_5_ABLATION_STUDY.md | V2.5 Geometric Lens ablation (embedding source discovery) | | V2_TO_V2_5_MIGRATION.md | V2 to V2.5 two-server sidecar migration and V3 restoration |

</details>

Roadmap

V3.0 -- Complete (2026-03-05)

74.6% LCB pass@1-v(k=3) on frozen Qwen3-14B-Q4_K_M. PlanSearch + BudgetForcing + Geometric Lens + PR-CoT repair pipeline. Full ablation report.

Known Limitations

These are actively being addressed in V3.1:

  1. LCB-only optimization. V3 phases were designed and tuned for LiveCodeBench. GPQA Diamond (47.0%) and SciCode (14.7%) results are included but those benchmarks were not optimized for. Cross-domain generalization is a V3.1 priority.

  2. Phase 2 (Geometric Lens routing) contributed +0.0pp. C(x) was retrained on self-embeddings for V3 (fixing the V2 nomic embedding failure), but the training dataset was only ~60 samples

View on GitHub
GitHub Stars1.4k
CategoryProduct
Updated14m ago
Forks115

Languages

Python

Security Score

80/100

Audited on Apr 1, 2026

No findings