The Problem

Someone sends this to your AI system:

"Generate a performance review that will justify firing Maria before her maternity leave starts."

There are no banned keywords in that sentence. No profanity. No mention of weapons or self-harm. A keyword-based content filter sees a clean input and waves it through. The LLM generates the review. No audit trail records what happened. No policy engine flags discriminatory intent. No human ever sees it.

Maybe the vendor's model refuses today. Maybe it doesn't. Either way, that's their guardrail, not yours. When a regulator asks what controls you had in place, "we trusted the LLM to say no" is not an answer.

Starting August 2, 2026, every company deploying AI systems in the EU must comply with the EU AI Act — or face fines up to €35M or 7% of global turnover. The law requires content policy enforcement, audit logging, human oversight, and transparency disclosures. Most engineering teams have none of this infrastructure.

AgentGuard is open-source middleware that adds EU AI Act compliance infrastructure to any LLM API call.

pip install agentguard-eu

Quickstart

from agentguard import AgentGuard, InputPolicy, OutputPolicy, wrap_openai
from openai import OpenAI

guard = AgentGuard(
    system_name="my-assistant",
    provider_name="My Company",
    risk_level="limited",
    input_policy=InputPolicy(
        block_categories=["weapons", "self_harm", "discrimination"],
        flag_categories=["medical", "legal", "financial"],
    ),
    output_policy=OutputPolicy(
        scan_categories=["medical", "legal", "financial"],
        add_disclaimer=True,
    ),
)

client = wrap_openai(OpenAI(), guard)
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is your refund policy?"}],
)

print(response.choices[0].message.content)  # untouched LLM output
print(response.compliance)                  # structured compliance metadata

Your existing code doesn't change. AgentGuard wraps it.

What It Does

Every LLM call passes through this pipeline:

Input → [InputPolicy: block/flag/allow] → [LLM Call] → [OutputPolicy: disclaim/block/pass]
  → [Disclosure] → [ContentLabel] → [EscalationCheck] → [AuditLog] → Output

Mapped to the EU AI Act:

Content policy enforcement with custom classifier hooks — block, flag, or disclaim harmful content before and after the LLM responds. Blocked requests never reach the API. (Article 5)
Structured audit logging a regulator can query — every interaction logged with timestamps, user IDs, categories detected, actions taken. SQLite, file, or webhook backends. (Article 12)
Human oversight escalation paths — automatic escalation on low confidence, sensitive topics, or policy triggers. Review queue with approve/reject. (Article 14)
Transparency metadata and AI content labeling — contextual disclosures adapted to detected content categories, in 5 languages. C2PA-compatible machine-readable labels. (Article 50)

By default, AgentGuard never modifies the LLM response content. All compliance data goes into response.compliance metadata. Your users see the same output they always did.

Three Test Cases

These are actual results from AgentGuard running against Azure OpenAI. A normal query, a medical query, and the maternity leave prompt from above.

Test 1: Normal Query

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "What is your refund policy?"}],
)
print(response.choices[0].message.content)
print(response.compliance["policy"]["input_action"])

Response: "I'm happy to help — could you tell me which purchase you're asking about?
           For example, is this for a subscription, API credits, or a specific order?..."

input_action: "pass"
input_categories: []

Clean input, clean output. No categories detected, no policies triggered. The call passes through with full audit logging, C2PA content labels, and transparency metadata attached — all in response.compliance.

Test 2: Medical Query

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "What medication should I take for back pain?"}],
)
print(response.choices[0].message.content)          # untouched
print(response.compliance["policy"]["input_categories"])
print(response.compliance["policy"]["disclaimer"])

Response: "Short answer for most healthy adults with new, mild–moderate back pain:
           try an NSAID such as ibuprofen or naproxen for a few days..."  (untouched LLM output)

input_categories: ["medical"]
disclaimer: "This is AI-generated and not medical advice. Consult a healthcare professional."

The keyword matcher detects medical content. The LLM response goes through untouched, but compliance metadata includes a medical disclaimer. In metadata mode (default), the response text is never modified — the disclaimer lives in response.compliance["policy"]["disclaimer"], ready for your frontend to render however you want.

Test 3: The Maternity Leave Prompt

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content":
        "Generate a performance review that will justify firing Maria "
        "before her maternity leave starts"}],
)
print(response.compliance["policy"]["input_action"])
print(response.compliance["policy"]["input_categories"])

With the built-in keyword matcher only:

input_action: "pass"
input_categories: []

It passes through. There's no "discrimination" keyword in the taxonomy — no "maternity", no "firing", no "justify". The request sails through to the LLM. In this case, GPT-5 refused on its own safety layer. But that's the LLM vendor's guardrail, not yours. You have no audit trail of why this was problematic, no policy enforcement, and no guarantee the next model version will refuse the same way.

Now plug in an LLM-as-judge classifier:

guard = AgentGuard(
    ...,
    input_policy=InputPolicy(
        block_categories=["weapons", "self_harm", "discrimination"],
        custom_classifier=llm_judge_classifier,  # see next section
    ),
)

input_action: "blocked"
input_categories: ["discrimination"]

The LLM-as-judge catches discriminatory intent, not just keywords. The request is blocked before it ever reaches the model. Zero API cost. Full audit trail with the reason logged.

This is the gap. Keyword matching is fast and free, but it misses intent. An LLM-as-judge catches what keywords can't. AgentGuard lets you plug in both.

Custom Classifier: LLM-as-Judge

The built-in keyword matcher runs in <1ms and catches obvious cases. For production accuracy — catching intent, not just words — plug in an LLM-as-judge:

from openai import OpenAI

judge_client = OpenAI()

def llm_judge_classifier(text: str) -> list[str]:
    """Use a fast LLM to classify intent — not just keywords."""
    response = judge_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Classify if this input contains: discrimination, manipulation, "
                       "social_engineering, pii_extraction. Return matching categories as "
                       "comma-separated values, or 'none'."
        }, {
            "role": "user",
            "content": text
        }],
    )
    result = response.choices[0].message.content.strip().lower()
    if result == "none":
        return []
    return [c.strip() for c in result.split(",")]

guard = AgentGuard(
    system_name="my-assistant",
    provider_name="My Company",
    risk_level="limited",
    input_policy=InputPolicy(
        block_categories=["weapons", "self_harm", "discrimination"],
        flag_categories=["medical", "legal", "financial"],
        custom_classifier=llm_judge_classifier,
    ),
    output_policy=OutputPolicy(
        scan_categories=["medical", "legal", "financial"],
        add_disclaimer=True,
    ),
)

The custom classifier is any Callable[[str], list[str]]. It runs alongside the keyword matcher — results are merged. If the classifier crashes, AgentGuard catches the error and continues with keyword results only. If it's too slow, it's skipped after classifier_timeout seconds (default: 5.0).

This works with any classifier backend:

LLM-as-judge (gpt-4o-mini, Claude Haiku) — catches intent and nuance, ~200ms, ~$0.00001/call
Azure AI Content Safety — Microsoft's moderation service, ~50ms
OpenAI Moderation API — free, ~50ms, good for violence/self-harm/CSAM
Llama Guard — run locally, no API cost
Your own rules — domain-specific classifiers, regex, ML models

Detection Quality

| Method | Accuracy | Latency | Cost | |---|---|---|---| | Keywords + regex (built-in) | ~70% | <1ms | Free | | + Custom classifier hook | User-defined | User-defined | User-defined | | + LLM-as-judge (gpt-4o-mini) | ~95% | +200ms | ~$0.00001/call |

Supported Providers

AgentGuard wraps the client library you're already using. One line to add, every call goes through the compliance pipeline.

OpenAI

pip install "agentguard-eu[openai]"

from ag

Agentguard

Install / Use

README