Agentunit
AgentUnit is a pytest-inspired evaluation harness for autonomous agents and retrieval-augmented generation (RAG) workflows. It helps you describe repeatable scenarios, connect them to your agent stack, and score results with both heuristic and LLM-backed metrics.
Install / Use
/learn @aviralgarg05/AgentunitREADME
AgentUnit
AgentUnit is a framework for evaluating, monitoring, and benchmarking multi-agent systems. It standardises how teams define scenarios, run experiments, and report outcomes across adapters, model providers, and deployment targets.
Overview
- Scenario-centric design – describe datasets, adapters, and policies once, then reuse them in local runs, CI jobs, and production monitors.
- Extensible adapters – plug into LangGraph, CrewAI, PromptFlow, OpenAI Swarm, Anthropic Bedrock, Phidata, and custom agents through a consistent interface.
- Comprehensive metrics – combine exact-match assertions, RAGAS quality scores, and operational metrics with optional OpenTelemetry traces.
- Production-first tooling – export JSON, Markdown, and JUnit reports, gate releases with regression detection, and surface telemetry in existing observability stacks.
Installation
AgentUnit requires Python 3.10 or later. The recommended workflow uses Poetry for dependency management.
git clone https://github.com/aviralgarg05/agentunit.git
cd agentunit
poetry install
poetry shell
To use pip instead:
python -m venv .venv
source .venv/bin/activate
pip install -e .
Optional integrations are published as extras; install only what you need:
poetry install --with promptflow,crewai,langgraph
# or with pip
pip install agentunit[promptflow,crewai,langgraph]
Run CI locally
Prerequisites: Python 3.10+,Poetry
- poetry install --with dev
- poetry check
- poetry run ruff check .
- poetry run ruff format --check .
- poetry run pytest
Optional Extras
| Extra | Includes | Use Case |
|-------|----------|----------|
| promptflow | promptflow>=1.0.0 | Azure PromptFlow integration |
| crewai | crewai>=0.201.1 | CrewAI multi-agent orchestration |
| langgraph | langgraph>=1.0.0a4 | LangGraph state machines |
| openai | openai>=1.0.0 | OpenAI models and Swarm |
| anthropic | anthropic>=0.18.0 | Claude/Bedrock integration |
| phidata | phidata>=2.0.0 | Phidata agents |
| all | All above extras | Complete installation |
Refer to the adapters guide for per-adapter requirements and feature support matrices.
Quickstart
2-Minute Copy-Paste Example
Create a file example_suite.py:
from agentunit import Scenario, DatasetCase, Runner
from agentunit.adapters import MockAdapter
from agentunit.metrics import ExactMatch
# Define test cases
cases = [
DatasetCase(
id="math_1",
query="What is 2 + 2?",
expected_output="4"
),
DatasetCase(
id="capital_1",
query="What is the capital of France?",
expected_output="Paris"
)
]
# Create scenario
scenario = Scenario(
name="Basic Q&A Test",
adapter=MockAdapter(), # Replace with your adapter
dataset=cases,
metrics=[ExactMatch()]
)
# Run evaluation
runner = Runner()
results = runner.run(scenario)
# Print results
print(f"Success rate: {results.success_rate:.1%}")
print(f"Average latency: {results.avg_latency:.2f}s")
Run it:
python example_suite.py
YAML Configuration Example
Create example_suite.yaml:
name: "Customer Support Q&A"
description: "Evaluate customer support agent responses"
adapter:
type: "openai"
config:
model: "gpt-4"
temperature: 0.7
max_tokens: 500
dataset:
cases:
- input: "How do I reset my password?"
expected: "Use the 'Forgot Password' link on the login page"
metadata:
category: "account"
- input: "What are your business hours?"
expected: "Monday-Friday 9AM-5PM EST"
metadata:
category: "general"
metrics:
- "exact_match"
- "semantic_similarity"
- "latency"
timeout: 30
retries: 2
Run it with the CLI:
agentunit example_suite.yaml \
--json results.json \
--markdown results.md \
--junit results.xml
Getting started
- Follow the Quickstart above for a 2-minute runnable example.
- Review Writing Scenarios for dataset and adapter templates plus helper constructors for popular frameworks.
- Consult the CLI reference to orchestrate suites from the command line and export results for CI, dashboards, or audits.
- Explore the adapters guide for concrete adapter implementations and feature support.
- Check the metrics catalog for all available evaluation metrics.
CLI Usage
AgentUnit exposes an agentunit CLI entry point once installed. Typical usage:
agentunit path.to.suite \
--metrics faithfulness answer_correctness \
--json reports/results.json \
--markdown reports/results.md \
--junit reports/results.xml
Programmatic runners are available through agentunit.core.Runner for notebook- or script-driven workflows.
Documentation map
| Topic | Reference | | --- | --- | | Quick evaluation walkthrough | Quickstart | | Scenario and adapter authoring | docs/writing-scenarios.md | | Adapter implementations guide | docs/adapters.md | | Metrics catalog and reference | docs/metrics-catalog.md | | CLI options and examples | docs/cli.md | | Architecture overview | docs/architecture.md | | Framework-specific guides | docs/platform-guides.md | | No-code builder guide | docs/nocode-quickstart.md | | OpenTelemetry integration | docs/telemetry.md | | Performance testing | docs/performance-testing.md | | Comparison to other tools | docs/comparison.md | | Templates | docs/templates/README.md |
Use the table above as the canonical navigation surface; every document cross-links back to related topics for clarity.
Development workflow
- Install dependencies (Poetry or pip).
- Run the test suite:
# Run all tests (unit + integration)
poetry run python3 -m pytest tests -v
# Run only unit tests (skip integration tests)
poetry run python3 -m pytest -m "not integration" -v
# Run only integration tests (requires framework dependencies)
poetry run python3 -m pytest tests/integration/ -v
- Execute targeted suites during active development, then run the full matrix before opening a pull request.
Integration Tests: The tests/integration/ directory contains tests that verify AgentUnit works with real framework implementations (LangGraph, etc.). These tests are automatically skipped if the required dependencies are not installed. See tests/integration/README.md for details.
Latest verification (2025-10-24): 144 passed, 10 skipped, 32 warnings. Warnings originate from third-party dependencies (langchain pydantic shim deprecations and datetime.utcnow usage). Track upstream fixes or pin patched releases as needed.
Running CI Checks Locally
Usage
Run all checks (same as CI):
poetry run ruff check .
poetry run ruff format --check .
poetry run pytest tests -v
Before opening a pull request, you can run the same checks locally that are executed in CI.
Requirements
- Python 3.10 or higher
- Poetry installed
Setup
Install dependencies (including dev tools):
poetry install --with dev
Contributing
We welcome contributions! Please see CONTRIBUTING.md for:
- Development setup and workflow
- Code style and linting guidelines
- Testing requirements
- Pull request process
- Issue labels and tags for open source events
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Security disclosures and sensitive topics should follow responsible disclosure guidelines outlined in SECURITY.md.
Research & Citation
AgentUnit is designed as a research-grade framework. If you use AgentUnit in your research, please see CITATION.md for citation information and reproducibility standards.
License
AgentUnit is released under the MIT License. See LICENSE for the full text.
Need an overview for stakeholders? Start with docs/architecture.md. Ready to extend the platform? Explore the templates under docs/templates/README.md.
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
110.9kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
351.8kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
