AgentProbe
Drop-in pytest plugin for regression-testing AI agents — snapshot baselines, semantic comparison, mock LLMs
Install / Use
/learn @he-yufeng/AgentProbeREADME
AgentProbe
Regression-testing for AI agents. Like Jest snapshots, but for LLMs.
Capture your agent's outputs, store them as baselines, and catch regressions in CI — with one decorator.
</div>The Problem
You ship an AI agent. It works great. Two weeks later, you update a prompt, swap a model, or bump a dependency — and something breaks. But you don't notice until a user complains, because there's no test that catches agent behavior regressions.
Traditional unit tests don't work for agents. The outputs are non-deterministic. They're natural language, not exact values. You can't just assertEqual. And even if you could, you'd spend more time writing test fixtures than writing the agent itself.
AgentProbe fixes this. One decorator captures your agent's output and saves it as a baseline snapshot. On the next run, it compares the new output against the baseline — using exact match or semantic similarity. If something changed, the test fails. Run it in CI, and you'll catch regressions before they hit production.
Quick Start
pip install agentprobe
1. Snapshot Testing
Capture agent outputs and compare them across runs:
from agentprobe import snapshot
@snapshot("summarize_article")
def test_summarize():
# Your agent code here
result = my_agent.summarize("The quick brown fox jumps over the lazy dog.")
return result
First run: creates a baseline in .agentprobe/snapshots/summarize_article.json.
Next runs: compares the output against the baseline. Fails if they differ.
2. Mock LLM
Test agent logic without hitting any API:
from agentprobe import MockLLM
def test_agent_with_mock():
mock = MockLLM(responses=[
"The document discusses three main topics.",
"Based on my analysis, the sentiment is positive."
])
# Use mock.chat.completions.create as a drop-in for openai
result = mock.chat.completions.create(
messages=[{"role": "user", "content": "Summarize this doc"}]
)
assert "three main topics" in result.choices[0].message.content
assert mock.call_count == 1
3. Tool Call Assertions
Verify your agent calls the right tools:
from agentprobe import assert_tool_called
def test_agent_uses_search():
tool_calls = [
{"name": "web_search", "arguments": {"query": "latest news"}},
{"name": "summarize", "arguments": {"text": "..."}},
]
assert_tool_called(tool_calls, "web_search", times=1)
assert_tool_called(tool_calls, "web_search", with_args={"query": "latest news"})
4. Schema Validation
Assert that agent outputs conform to a structure:
from pydantic import BaseModel
from agentprobe import assert_schema
class AgentResponse(BaseModel):
answer: str
confidence: float
sources: list[str]
def test_output_structure():
output = my_agent.run("What is the capital of France?")
result = assert_schema(output, AgentResponse)
assert result.confidence > 0.8
Pytest Integration
AgentProbe registers as a pytest plugin automatically. Use the agentprobe fixture:
def test_with_fixture(agentprobe):
output = my_agent.run("Hello")
result = agentprobe.capture("greeting_test", output)
assert result.passed
CLI Flags
# Run tests normally
pytest tests/
# Update all snapshots (regenerate baselines)
pytest tests/ --agentprobe-update
# Use semantic comparison instead of exact match
pytest tests/ --agentprobe-mode=semantic --agentprobe-threshold=0.85
AgentProbe CLI
# Run tests
agentprobe run
# Run with semantic comparison
agentprobe run --mode semantic --threshold 0.9
# Update all snapshots
agentprobe update
Comparison Modes
| Mode | How it works | When to use |
|------|-------------|-------------|
| exact (default) | String equality after serialization | Deterministic agents, structured outputs |
| semantic | Cosine similarity via sentence-transformers | Non-deterministic LLM outputs |
For semantic mode, install the optional dependency:
pip install agentprobe[semantic]
MockLLM Features
MockLLM is a drop-in replacement for openai.Client that returns scripted responses:
from agentprobe import MockLLM
# Scripted responses (consumed in order)
mock = MockLLM(responses=["First response", "Second response"])
# Falls back to default after scripted responses are exhausted
mock = MockLLM(responses=["Only one"], default_response="I don't know")
# Simulate tool calls
mock = MockLLM(responses=[
{"tool_calls": [{"id": "1", "function": {"name": "search", "arguments": '{"q": "test"}'}}]}
])
# Check what was called
mock.create(messages=[{"role": "user", "content": "Hi"}])
print(mock.calls) # all recorded calls
print(mock.call_count) # number of calls
# Reset for reuse
mock.reset()
How It Compares
| Feature | AgentProbe | DeepEval | Promptfoo | |---------|-----------|----------|-----------| | pytest native | Yes (plugin) | Separate runner | CLI only | | Snapshot baselines | Yes | No | No | | Semantic comparison | Yes | Yes | Yes | | Mock LLM | Yes (built-in) | No | Partial | | Tool call assertions | Yes | No | No | | Schema validation | Yes (Pydantic) | Partial | No | | Cloud required | No | Optional | No | | Config format | Python code | Python code | YAML |
GitHub Actions
Add this to your CI pipeline:
- name: Run agent tests
run: |
pip install agentprobe
pytest tests/ -v
Snapshot files (.agentprobe/snapshots/) should be committed to your repo so CI can compare against them.
FAQ
Do I need an API key to use AgentProbe?
No. Use MockLLM for deterministic tests without any API calls. If you want to test against a real LLM, you'll need the appropriate API key, but that's your agent's dependency, not AgentProbe's.
How does semantic comparison work? It uses sentence-transformers to embed both the baseline and current output, then computes cosine similarity. If the score is above the threshold (default 0.85), the test passes. This handles cases where the wording changes but the meaning stays the same.
Can I use this with LangChain / CrewAI / AutoGen? Yes. AgentProbe doesn't care what framework you use. It tests the output of your agent, not the internals. Just call your agent inside the test function and return the result.
What about flaky tests from non-deterministic outputs?
Use semantic mode with an appropriate threshold. If your agent's outputs vary significantly between runs, lower the threshold. If they should be consistent, raise it. You can also use MockLLM to make the underlying LLM deterministic.
How do snapshot files work?
Snapshots are stored as JSON in .agentprobe/snapshots/. The first time you run a test, it creates the baseline. Subsequent runs compare against it. Use --agentprobe-update to regenerate baselines after intentional changes.
Roadmap
- [ ] Async agent support (
async deftests) - [ ] Multi-step agent tracing (record intermediate steps)
- [ ] Cost tracking integration (with TokenTracker)
- [ ] Visual diff in terminal for snapshot mismatches
- [ ]
pytest-xdistparallel support
Contributing
Contributions welcome. If you're testing AI agents in production and have ideas for what's missing, open an issue.
License
<div align="center">
Stop shipping untested agents.
</div>Related Skills
node-connect
340.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.1kCommit, push, and open a PR
