SoyLM

Local-first RAG system powered by a single 9B-parameter LLM. No vector database, no embedding models, no cloud APIs — just SQLite FTS5, BM25 ranking, and Nemotron-Nano-9B-v2-Japanese served locally via vLLM.

Overview

SoyLM is a self-contained Retrieval-Augmented Generation application that runs entirely on local hardware. Upload documents, URLs, or YouTube videos as sources. The LLM analyzes each source into structured summaries stored in SQLite, then enables grounded Q&A with source citations — all through a single 9B model handling every stage of the pipeline.

What makes it different

No vector database, no embeddings. Retrieval uses SQLite FTS5 full-text search with BM25 ranking. The LLM extracts bilingual keywords (JA↔EN) from the user's query, which are used as FTS5 MATCH terms. This eliminates the need for separate embedding models, vector stores, and the associated infrastructure.
Single model for the entire pipeline. One Nemotron-Nano-9B instance handles source analysis, keyword extraction, and answer generation. No multi-model orchestration.
Minimal footprint. ~1,900 lines total (Python + HTML/JS). No React, no Node.js build step, no external search infrastructure. Two Python files, two HTML templates, one SQLite database.
Thinking transparency. Nemotron's chain-of-thought reasoning tokens are streamed to the user in real-time via SSE, making the model's thought process visible before the final answer arrives.

Features

| Feature | Details | |---|---| | Source ingestion | Files (.txt, .md, .py, .pdf, etc.), web URLs, YouTube transcripts, paste text, DuckDuckGo web search for URL discovery | | Web fetching | httpx with automatic Playwright (headless Chromium) fallback for JS-rendered pages; same-domain link crawling | | YouTube | Automatic transcript extraction via youtube-transcript-api, with oEmbed metadata | | Source analysis | LLM-generated structured JSON (summary, key points, topics, entities, language) with FTS5 trigger auto-indexing | | RAG search | Bilingual LLM keyword extraction (JA↔EN) → SQLite FTS5 MATCH with BM25 ranking | | Streaming | SSE with separated thinking (real-time) and content (complete block) channels | | Deduplication | SHA-256 content hashing prevents duplicate sources within a notebook | | Chat history | Persistent chat logs per notebook with JSON export |

Architecture

Browser (Jinja2 SSR + vanilla JS)
  ├── Sources panel     ← upload / manage / DDG search
  ├── Chat (SSE)        ← streaming Q&A with thinking
  └── Chat history      ← logs + JSON export
        │
        ▼
FastAPI backend
  ├── app.py    (~810 LOC)  ← routes, RAG logic, LLM calls
  ├── search.py (~220 LOC)  ← URL fetch, Playwright, YouTube, DDG
  ├── Nemotron-Nano-9B (vLLM, OpenAI-compatible API)
  └── SQLite (soylm.db, WAL mode, FTS5 virtual table)

RAG pipeline

User query
  │
  ├─ 1. Keyword extraction (LLM, thinking disabled)
  │     "Chromebookのセットアップ方法"
  │       → "Chromebook, setup, セットアップ"
  │
  ├─ 2. FTS5 search (SQLite, BM25 ranking)
  │     SELECT ... FROM sources_fts
  │     WHERE sources_fts MATCH '"Chromebook" OR "setup" OR "セットアップ"'
  │     ORDER BY rank
  │
  ├─ 3. Context assembly
  │     Top-N sources → full text + structured metadata
  │
  └─ 4. Generation (LLM, streaming, thinking enabled)
        System prompt + 【ソースデータ】[1]..[N] + 【質問】
          → Thinking tokens (streamed real-time)
          → Answer with citations [1], [2] (sent as complete block)

The keyword extraction step is what makes cross-lingual retrieval work without embeddings: a Japanese query is decomposed into both Japanese and English noun terms, and the combined set is used as FTS5 search terms. Sources in either language can match queries in either language.

Source loading pipeline

Input (file / URL / YouTube / paste)
  │
  ├─ Deduplication (SHA-256 hash check)
  │
  ├─ Text extraction
  │   ├── Files: UTF-8 decode / PyMuPDF for PDFs
  │   ├── URLs: httpx → Playwright fallback (if < 500 chars)
  │   ├── YouTube: youtube-transcript-api → oEmbed metadata
  │   └── Paste: direct text
  │
  └─ LLM analysis → structured JSON
      { summary, key_points, topics, entities, language, full_text }
      → SQLite INSERT triggers automatic FTS5 indexing

Streaming architecture

Thinking tokens and content tokens are separated at the SSE level:

Thinking — streamed chunk-by-chunk in real-time as the model reasons (reasoning_content field from vLLM)
Content — collected server-side and sent as a single complete block after thinking finishes

This design ensures the final answer is coherent and not interleaved with partial reasoning, while maintaining full transparency into the model's chain-of-thought process.

Setup

Prerequisites

Python 3.11+
NVIDIA GPU with vLLM serving Nemotron-Nano-9B-v2-Japanese

vLLM configuration

Critical flags for Nemotron-Nano-9B (Mamba2+Attention hybrid architecture):

vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese \
  --mamba_ssm_cache_dtype float32 \
  --max-model-len 16384 \
  --dtype auto

| Flag | Requirement | Reason | |---|---|---| | --mamba_ssm_cache_dtype float32 | Mandatory | Without this, the Mamba2 SSM cache uses reduced precision and produces degraded outputs | | --enable-prefix-caching | Do NOT enable | Corrupts SSM state on Mamba2 hybrid models — see Appendix 1 |

Recommended version: vLLM v0.15.1. Do not upgrade to v0.18.0+ — see Appendix 2.

Install

git clone https://github.com/soy-tuber/SoyLM.git
cd SoyLM
uv venv && uv pip install -r requirements.txt
playwright install chromium

Run

# vLLM must be running on an OpenAI-compatible endpoint
uvicorn app:app --host 0.0.0.0 --port 8080

Open http://localhost:8080

Environment variables

| Variable | Default | Description | |---|---|---| | NEMOTRON_BASE | http://localhost:8000/v1 | vLLM endpoint (any OpenAI-compatible API) | | NEMOTRON_MODEL | nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese | Model name | | STREAM_MAX_TOKENS | 8192 | Max tokens per streaming response |

SoyLM connects to any OpenAI-compatible endpoint — it does not manage the vLLM process. Use systemd, a process manager, or a gateway service for vLLM lifecycle management.

Usage

Create a notebook
Add sources — upload files, enter URLs, paste YouTube links, paste text, or search DuckDuckGo for URLs
Click Load Sources — the LLM analyzes each source and generates structured summaries
Ask a question — matching sources are extracted automatically via FTS5 + BM25
Click Generate — the model thinks (visible in real-time), then delivers a grounded answer with [1], [2] source citations

Design rationale

Why FTS5 + BM25 instead of vector search

Most RAG systems use vector search (FAISS, Chroma, Qdrant, pgvector, etc.) with a separate embedding model. SoyLM deliberately avoids this:

Infrastructure cost. Vector databases require a separate embedding model (often another GPU or API call per document), a vector store process, and index management. FTS5 runs inside SQLite — zero additional infrastructure.
Predictability. BM25 ranks by exact term frequency. For a system grounded in specific source documents (not open-domain semantic search), exact matching with known keywords is more predictable than cosine similarity in embedding space.
Cross-lingual retrieval via LLM. Instead of multilingual embeddings, SoyLM uses the LLM itself to extract bilingual keywords from the query. This is a single lightweight LLM call (~64 tokens) that produces both Japanese and English search terms, enabling cross-lingual retrieval through the same FTS5 index.
No chunking required. Vector search typically requires splitting documents into fixed-size chunks and embedding each. SoyLM stores full documents with LLM-generated metadata, and FTS5 searches across the complete text. The LLM's context window (up to 16K tokens) handles the full source content.

The trade-off: FTS5 cannot match semantically similar terms that don't share surface forms. In practice, the LLM keyword extraction compensates for this by generating synonyms and translations.

Why a single 9B model for everything

Nemotron-Nano-9B-v2-Japanese is a Mamba2+Attention hybrid that handles Japanese and English natively. Using one model for all pipeline stages eliminates:

Model coordination and routing logic
Multiple GPU memory allocations
Latency from cross-model API calls

The model's built-in thinking mode (enable_thinking via chat template) provides chain-of-thought reasoning without requiring a larger model or separate reasoning step. With --mamba_ssm_cache_dtype float32 and prefix caching disabled, output quality is production-grade.

Inference parameters

| Parameter | Value | Rationale | |---|---|---| | temperature | 0.1 | Low temperature for factual grounding — reduces hallucination while allowing slight variation | | max_tokens (streaming) | 8192 | Sufficient for detailed answers with citations | | max_tokens (utility calls) | 64–2048 | Minimal allocation for keyword extraction and source analysis | | enable_thinking | true (chat) / false (utility) | Thinking enabled only for user-facing generation; disabled for keyword extraction and analysis to save tokens |

File structure

SoyLM/
├── app.py              # FastAPI backend, RAG logic, LLM interface
├── search.py           # URL fetch, Playwright fallback, YouTube, DDG
├── tools.py            # Tool definitions (reserved for future

SoyLM

Install / Use

README

SoyLM

Overview

What makes it different

Features

Architecture

RAG pipeline

Source loading pipeline

Streaming architecture

Setup

Prerequisites

vLLM configuration

Install

Run

Environment variables

Usage

Design rationale

Why FTS5 + BM25 instead of vector search

Why a single 9B model for everything

Inference parameters

File structure