Govllm
Continuous LLM governance monitoring for regulated environments - EU AI Act, GDPR, ANSSI. Self-hosted, profile-driven, no data leaves your infrastructure.
Install / Use
/learn @JehanneDussert/GovllmREADME
govllm
How do you justify a model choice six months after go-live?
Self-hosted LLM governance monitoring for regulated environments. Continuous scoring against EU AI Act, GDPR, and ANSSI — not a one-shot benchmark.
Built out of a question I couldn't find a good answer to, working on LLM deployment in the French public sector. Directly applicable to AI Act Article 9 requirements (ongoing risk management) and NIS2 operational continuity constraints.

What it does
govllm scores LLM outputs continuously against configurable governance profiles. Each response is evaluated by a local LLM-as-a-judge across criteria mapped to regulatory frameworks. The best-performing model per use case is selected automatically — based on your governance criteria, not raw performance metrics.
Request → Governance profile → LLM-as-a-judge scoring → Dynamic routing → Model A / B / C / D
↑ |
└──────────── metrics refine criteria ─────┘
No data leaves your infrastructure. Local models via Ollama. Observable via Grafana and Prometheus.
Architecture
User
│
▼
Frontend :5173 (Vue 3 + ECharts)
│
├──► llm-gateway :8001 ──► LiteLLM ──► Ollama (qwen / gemma / llama / deepseek)
│ │
│ └──── Redis pub/sub
│
├──► observability :8002 ──► Prometheus / Grafana / Langfuse
│
└──► evaluation :8003 ──► Local judge (Ollama) ──► Benchmark · Matrix · Score
Three independent FastAPI microservices share a back/shared/ layer (Pydantic schemas + config) and communicate via HTTP and Redis pub/sub.
Screenshots
Model × use case matrix
Score heatmap per model and use case — auto-routes traffic to best performer per governance profile.
Governance profiles & judge configuration
Activate a full compliance profile in one click. Criteria, weights and use cases are configurable from the UI.
Quickstart
Prerequisites: Docker, docker compose, uv.
git clone https://github.com/JehanneDussert/govllm
cd govllm
cp infra/.env.example infra/.env
# Fill in Langfuse keys
make dev # hot reload — code changes reflected immediately
# or
make prod # built images + nginx front
make pull-models
Services:
| Service | URL | |---|---| | Frontend | http://localhost:5173 | | Gateway | http://localhost:8001/docs | | Observability | http://localhost:8002/docs | | Evaluation | http://localhost:8003/docs | | Langfuse | http://localhost:3000 | | Grafana | http://localhost:3001 | | Prometheus | http://localhost:9090 |
Governance profiles
Four built-in profiles, each activating a targeted set of criteria and weights:
| Profile | Frameworks | Focus | |---|---|---| | AI Act Compliance | EU AI Act Art. 5, 13, 14 | Transparency, human oversight, non-manipulation | | Data Protection | GDPR, ANSSI | Data privacy, leakage prevention, traceability | | Security | ANSSI, OWASP LLM Top 10 | Prompt injection, robustness, adversarial inputs | | Accessibility & Inclusion | RGAA, FALC | Language clarity, cognitive load, inclusive design |
Profiles are applied at runtime — switching a profile updates which criteria are active and their weights without restarting any service. Custom profiles can be created from the Settings view.
Judge criteria
The evaluation layer runs a local LLM-as-a-judge after each response. The system prompt is displayed in full in the Settings view. All criteria are configurable from the UI; custom criteria can be added.
| Criterion | Regulatory anchor | Default | |---|---|---| | Relevance | Quality baseline | ✅ | | Factual reliability | AI Act | ✅ | | Prompt injection | OWASP LLM01, ANSSI | ✅ | | Data leakage | OWASP LLM02, ANSSI | ✅ | | Ethical refusal | ANSSI, ethics | ✅ | | Non-manipulation | AI Act Art. 5 | — | | Human oversight | AI Act Art. 14 | — | | Explicability | AI Act Art. 13 | — | | Transparency | AI Act | — | | Data privacy | GDPR | — | | Language clarity | RGAA, FALC | — | | Cognitive load | RGAA | — | | Fairness | AI Act, ethics | — | | Robustness | ANSSI | — |
The judge model runs locally (ollama/gemma3:1b by default). Evaluation calls are filtered from the traces view so only user interactions appear.
Model × use case matrix
Scores accumulate per use case in Redis. The matrix view shows which model performs best per task under the active governance profile:
qwen2.5:1.5b llama3.2:3b gemma3:1b deepseek-r1:1.5b
Summary 0.84 0.71 0.69 0.72
Translation 0.79 0.88 0.74 0.71
Code 0.72 0.85 0.82 0.77
Administrative writing 0.88 0.82 0.71 —
→ gemma3 and llama3.2 lead on code, qwen2.5 on admin writing. The smart router reads this matrix at inference time and routes to the best-scoring model for the active profile and use case.
Multi-model benchmark
curl http://localhost:8003/benchmark/results
{
"models": [
{ "model": "ollama/qwen2.5:1.5b", "sample_size": 12, "avg_latency_ms": 4.2, "avg_eval_score": 0.84 },
{ "model": "ollama/gemma3:1b", "sample_size": 9, "avg_latency_ms": 2.1, "avg_eval_score": 0.82 },
{ "model": "ollama/llama3.2:3b", "sample_size": 14, "avg_latency_ms": 8.7, "avg_eval_score": 0.76 },
{ "model": "ollama/deepseek-r1:1.5b", "sample_size": 7, "avg_latency_ms": 5.3, "avg_eval_score": 0.71 }
],
"winner": "ollama/qwen2.5:1.5b",
"window": "last 50 traces"
}
Winner is determined by eval score when available across all models, latency otherwise.
Stack
| Layer | Technology | |---|---| | Inference | Ollama — qwen2.5:1.5b · gemma3:1b · llama3.2:3b · deepseek-r1:1.5b | | Proxy | LiteLLM | | Backend | FastAPI · Python 3.11 · uv | | Tracing | Langfuse v2 | | Metrics | Prometheus + Grafana | | Event bus | Redis | | Reverse proxy | Caddy | | Frontend | Vue 3 · TypeScript · ECharts | | Infra | Docker Compose |
API endpoints
llm-gateway — :8001
POST /chat # chat completion (streaming SSE + non-streaming)
GET /health
observability — :8002
GET /metrics?window=24h # latency p50/p95/p99, error rate, request count per model
GET /traces?limit=50 # production traces with eval scores (judge traces filtered)
evaluation — :8003
GET /benchmark/results # multi-model benchmark across all configured models
GET /matrix # use case × model score matrix
GET /matrix/routing # recommended model for active profile + use case
GET /config/judge # judge configuration
PUT /config/judge # update judge configuration
POST /config/judge/profile/{id} # activate a governance profile
POST /eval/score # trigger async evaluation (returns 202 immediately)
GET /eval/result/{trace_id} # poll for evaluation result
Project structure
govllm/
├── .env.example
├── Makefile
├── back/
│ ├── shared/src/shared/ # config.py, schemas.py
│ ├── llm-gateway/ # chat endpoint, Redis publisher
│ ├── observability/ # metrics, traces, Grafana proxy
│ └── evaluation/ # judge, benchmark, matrix, eval runner, profiles
├── front/
│ └── src/
│ ├── views/ # Chat, Metrics, Traces, Benchmark, Matrix, Settings
│ ├── components/ # MessageScore (async judge display)
│ ├── stores/ # chat.ts, judge.ts
│ └── api/client.ts
└── infra/
├── docker-compose.yml
├── docker-compose.dev.yml
├── docker-compose.prod.yml
├── litellm_config.yaml
├── prometheus.yml
└── grafana/provisioning/
Key design decisions
Governance from metrics. Model selection is driven by governance criteria, not performance alone. The score matrix accumulates from real production usage — not synthetic benchmarks.
Local evaluation judge. Scoring runs on Ollama — sovereign and usable in air-gapped or regulated environments (public sector, healthcare, finance). No response data sent to external APIs.
Profile-driven routing. Switching a governance profile at runtime updates which criteria are active and their weights. The routing layer reads the active profile from Redis at inference time and recommends the best-scoring model for that profile and use case.
Shared schema layer. All three microservices share back/shared/src/shared/ for Pydantic schemas and config — single source of truth for data contracts.
Judge traces filtered. Evaluation calls to LiteLLM are excluded from the traces view so only user interactions appear.
Dev/prod parity via compose overrides. make dev mounts source volumes with --reload. make prod builds images and serves the front via nginx. Same base compose file, no drift.
Roadmap
Governance
- [ ] Governance-driven routing — enforce model selection based on governance profile scores, block non-compliant models automatically
- [ ] Drift detection — automatic score trend alerts, quarantine on threshold breach
- [ ] Audit log export — consolidated compliance report (
/audit/export) for CISO review - [ ] Judge specialisation — assign different judge models per regulatory criteri
