Bc250
AMD BC-250 (PS5 APU) setup guide — Ollama + Vulkan inference, poor man's AI assistant via Signal, stable-diffusion.cpp image generation
Install / Use
/learn @akandr/Bc250README
██████╗ ██████╗ ██████╗ ███████╗ ██████╗
██╔══██╗██╔════╝ ╚════██╗██╔════╝██╔═████╗
██████╔╝██║ █████╗ █████╔╝███████╗██║██╔██║
██╔══██╗██║ ╚════╝██╔═══╝ ╚════██║████╔╝██║
██████╔╝╚██████╗ ███████╗███████║╚██████╔╝
╚═════╝ ╚═════╝ ╚══════╝╚══════╝ ╚═════╝
<div align="center">
GPU-accelerated AI home server on an obscure AMD APU — Vulkan inference, autonomous intelligence, Signal chat
Zen 2 · GFX1013 ("RDNA 1.5", informal) · 16 GB unified · Vulkan · 35B MoE @ 37.5 tok/s · 256K alloc / 32K practical filled ctx · 330 autonomous jobs/cycle · 130 dashboard pages
The BC-250 powered by an ATX supply, cooled by a broken AIO radiator with 3 fans just sitting on top of it. Somehow runs 24/7 without issues so far.
</div>A complete guide to running a 35-billion-parameter language model (Mixture-of-Experts architecture), FLUX.2 image generation, and 330 autonomous jobs on the AMD BC-250 — a crypto-mining board built around AMD's Cyan Skillfish APU (Zen 2 + GFX1013 GPU, 16 GB GDDR6), often associated by the community with the PS5's silicon lineage (Phoronix, LLVM AMDGPU), repurposed as a headless AI server with a community-patched BIOS.
35B MoE at 37.5 tok/s (tokens/second) with a 256K allocation ceiling and 32K practical filled context (64K allocation default), FLUX.2-klein-9B as the preferred image model from side-by-side testing, hardware-specific driver workarounds, memory tuning notes, and real-world benchmarks on this niche hardware. If you're new to LLM terminology, see the glossary below.
<details><summary><b>Quick glossary — LLM inference terms used in this document</b></summary>What makes this unusual: This document describes one public, real-world LLM inference deployment on BC-250 / GFX1013 hardware — GFX10-era silicon informally called "RDNA 1.5" by the community. ROCm's userspace libraries don't ship GFX1013 support. OpenCL/rusticl was not functional in this configuration. On this Fedora 43 / Mesa 25.3.4 stack, Vulkan was the only GPU compute path that proved usable — and even that required working around two kernel memory bottlenecks (GTT cap + TTM pages_limit) before 14B models would run.
Disclaimer: Unless otherwise stated, performance figures in this document are local measurements from one BC-250 board running Fedora 43, Mesa 25.3.4, and Ollama 0.18.0 with specific model quantizations. They are not vendor benchmarks and may not be reproducible on different software stacks.
| Term | What it means | |------|---------------| | LLM | Large Language Model — a neural network trained on text that generates responses token by token. Think of it as a stateless function: prompt in, text out. | | Token | The basic unit LLMs operate on. Roughly ¾ of a word in English. "Hello world" ≈ 2 tokens. | | tok/s | Tokens per second — the generation throughput. Higher = faster responses. | | Parameters (3B, 14B, 35B) | The number of trained weights in the model. More parameters generally means better quality but more memory and slower inference. A 14B model has 14 billion floating-point weights. | | Quantization (Q4_0, IQ2_M, Q4_K_M) | Compressing model weights from 16-bit floats to fewer bits. Q4 = 4 bits per weight (~4× smaller). IQ2_M ≈ 2.5 bits (~6× smaller). Trades precision for memory — like choosing between float32 and int8 for a DSP pipeline. | | GGUF | File format for quantized models (from llama.cpp). Contains weights + metadata. Analogous to a firmware binary with embedded config. | | Context window / context length | How many tokens the model can "see" at once (prompt + response). A 64K context = ~48K words. The model has no memory between calls — everything must fit in this window. | | KV cache | Key-Value cache — working memory allocated during inference to store attention state for each token in the context. Grows linearly with context length. This is the main VRAM consumer beyond model weights. | | Prefill | The phase where the model processes your entire prompt before generating the first output token. Speed measured in tok/s. Often compute-heavy at short prompts; at larger contexts, memory traffic becomes a major limiter. | | Generation | The phase where the model produces output tokens one at a time. Each new token requires reading all model weights once. Bottlenecked by memory bandwidth × parameter count. | | TTFT | Time To First Token — wall-clock delay from sending a prompt to receiving the first output token. Includes model load time (if cold) + prefill time. | | MoE (Mixture of Experts) | Architecture where only a subset of parameters activate per token. A 35B MoE with 3B active means 35B total weights in memory, but only 3B are used for each token's computation — faster than a 35B dense model, with quality closer to 35B than 3B. | | Dense model | A standard model where all parameters activate for every token. A 14B dense model does 14B operations per token. | | Ollama | Local LLM inference server. Wraps llama.cpp with an HTTP API. Manages model loading, KV cache, and GPU offload. | | Think mode / thinking tokens | Some models (DeepSeek-R1, Qwen3) generate internal reasoning tokens before the visible answer. These consume the output budget and context window but aren't shown to the user. |
</details>░░ Contents
| § | Section | What you'll find |
|:---:|---------|------------------|
| | PART I ─ HARDWARE & SETUP | |
| 1 | Hardware Overview | Specs, memory architecture, power |
| 2 | Driver & Compute Stack | What works (Vulkan), what doesn't (ROCm) |
| 3 | Ollama + Vulkan Setup | Install, GPU memory tuning (GTT + TTM) |
| 4 | Models & Benchmarks | Model compatibility, speed, memory budget |
| 4.10 | ↳ Ollama vs llama.cpp | TG: +45% MoE, +7% dense; 32K+ only via Ollama |
| | PART II ─ AI STACK | |
| 5 | Signal Chat Bot | Chat, vision analysis, audio transcription, smart routing |
| 6 | Image Generation | FLUX.2-klein-9B, synchronous pipeline |
| | PART III ─ MONITORING & INTEL | |
| 7 | Netscan Ecosystem | 330 jobs, queue-runner v7, 130-page dashboard |
| 8 | Career Intelligence | Two-phase scanner, salary, patents |
| | PART IV ─ COMPREHENSIVE BENCHMARKS | |
| B1 | Methodology | 5-phase suite, prompt standardization, scoring criteria |
| B2 | Statistical Validation | CV < 1.5%, single-run reliability proof |
| B3 | Generation Speed | tok/s, prefill, TTFT, VRAM (31 of 33 models) |
| B4 | Quality Assessment | 5 tasks × 3 runs, per-task breakdown, tier analysis |
| B5 | Context Scaling | Filled-context sweep, degradation, ceiling grid |
| B6 | Long-Context Quality | Fact retrieval, multi-hop reasoning, synthesis @ 16K+32K |
| B7 | Cold-Start Timing | TTFT, load speed, Signal chat latency profile |
| B8 | Quantization Impact | Q4_K_M vs Q8_0 comparison |
| B9 | Image Generation | 8 models, resolution scaling, video, upscaling |
| B10 | Model Recommendations | Best model per use case |
| | PART V ─ REFERENCE | |
| 9 | Repository Structure | File layout, deployment paths |
| 10 | Troubleshooting | Common issues and fixes |
| 11 | Known Limitations | What's broken, what to watch out for |
| 12 | Software Versions | Pinned versions of all components |
| 13 | References | Links to all upstream projects and models |
| A | OpenClaw Archive | Original architecture, why it was ditched |
PART I — Hardware & Setup
1. Hardware Overview
The AMD BC-250 is a crypto-mining board built by ASRock Rack around AMD's Cyan Skillfish APU — Zen 2 CPU (6c/12t) and GFX1013 GPU (24 CUs) with 16 GB GDDR6 unified memory. The Cyan Skillfish silicon is widely associated with the same hardware family as Sony's PS5 APU (Oberon), and a common community theory is that these are salvaged/binned PS5 dies that didn't meet Sony's specs. This is plausible but not publicly confirmed by AMD — treat it as informed speculation, not established fact. Based on reseller listings and community discussion, these boards were deployed in multi-board rack mining systems by ASRock Rack. After the racks were decommissioned, individual boards became available on AliExpress.
GFX1013 vs PS5: The PS5's Oberon is RDNA 2 (GFX10.3,
gfx1030+). For practical purposes, the BC-250's Cyan Skillfish (gfx1013) behaves like a GFX10.1-era variant with fewer CUs than a full PS5 APU and an older ISA — though exact die-level comparisons are speculative without official AMD documentation. Unusually for GFX10.1, it retains hardware ray tracing extensions (VK_KHR_ray_tracing_pipeline,VK_KHR_ray_query). The community label "RDNA 1.5" (used throughout this document) reflects this hybrid positioning: GFX10.1 instruction set with ray tracing hardware more typical of RDNA 2. This is informal shorthand — not an official AMD designation.
BIOS is not stock. The board ships with
Related Skills
qqbot-channel
347.9kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
347.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
