AgentEval

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET

Generate Convert Improve

Install / Use

/learn @AgentEvalHQ/AgentEval

About this skill

Quality Score

0/100

README

AgentEval

<img src="assets/AgentEval_bounded.png" alt="AgentEval Logo" width="450" /> The .NET Evaluation Toolkit for AI Agents <a href="https://github.com/AgentEvalHQ/AgentEval/actions/workflows/ci.yml"><img src="https://github.com/AgentEvalHQ/AgentEval/actions/workflows/ci.yml/badge.svg" alt="Build" /></a> <a href="https://github.com/AgentEvalHQ/AgentEval/actions/workflows/security.yml"><img src="https://github.com/AgentEvalHQ/AgentEval/actions/workflows/security.yml/badge.svg" alt="Security" /></a> <a href="https://codecov.io/gh/AgentEvalHQ/AgentEval"><img src="https://codecov.io/gh/AgentEvalHQ/AgentEval/graph/badge.svg?token=Y28TAK3LNH" alt="Coverage" /></a> <a href="https://joslat.github.io/AgentEval/"><img src="https://img.shields.io/badge/docs-GitHub%20Pages-blue" alt="Documentation" /></a> <a href="https://www.nuget.org/packages/AgentEval"><img src="https://img.shields.io/nuget/v/AgentEval.svg" alt="NuGet" /></a> <a href="https://github.com/AgentEvalHQ/AgentEval/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="License" /></a> <img src="https://img.shields.io/badge/MAF-1.0.0--rc3-blueviolet" alt="MAF 1.0.0-rc3" /> <img src="https://img.shields.io/badge/.NET-8.0%20|%209.0%20|%2010.0-512BD4" alt=".NET 8.0 | 9.0 | 10.0" />

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS and DeepEval do for Python, AgentEval does for .NET, with the fluent assertion APIs .NET developers expect.

For years, agentic developers have imagined writing evaluations like this. Today, they can.

[!WARNING] Preview — Use at Your Own Risk

This project is experimental (work in progress). APIs and behavior may change without notice. Do not use in production or safety-critical systems without independent review, testing, and hardening.

Portions of the code, tests, and documentation were created with assistance from AI tools and reviewed by maintainers. Despite review, errors may exist — you are responsible for validating correctness, security, and compliance for your use case.

Licensed under the MIT License — provided "AS IS" without warranty. See LICENSE and DISCLAIMER.md.

The Code You Have Been Dreaming Of

Compare Models, Get a Winner, Ship with Confidence

var stochasticRunner = new StochasticRunner(harness);
var comparer = new ModelComparer(stochasticRunner);

var result = await comparer.CompareModelsAsync(
    factories: new IAgentFactory[]
    {
        new AzureModelFactory("gpt-4o", "GPT-4o"),
        new AzureModelFactory("gpt-4o-mini", "GPT-4o Mini"),  
        new AzureModelFactory("gpt-35-turbo", "GPT-3.5 Turbo")
    },
    testCases: agenticTestSuite,
    metrics: new[] { new ToolSuccessMetric(), new RelevanceMetric(evaluator) },
    options: new ComparisonOptions(RunsPerModel: 5));

Console.WriteLine(result.ToMarkdown());

Output:

## Model Comparison Results

| Rank | Model         | Tool Accuracy | Relevance | Mean Latency | Cost/1K Req |
|------|---------------|---------------|-----------|--------------|-------------|
| 1    | GPT-4o        | 94.2%         | 91.5      | 1,234ms      | $0.0150     |
| 2    | GPT-4o Mini   | 87.5%         | 84.2      | 456ms        | $0.0003     |
| 3    | GPT-3.5 Turbo | 72.1%         | 68.9      | 312ms        | $0.0005     |

**Recommendation:** GPT-4o - Highest tool accuracy (94.2%)
**Best Value:** GPT-4o Mini - 87.5% accuracy at 50x lower cost

Assert on Tool Chains Like You Have Always Imagined

result.ToolUsage!.Should()
    .HaveCalledTool("SearchFlights", because: "must search before booking")
        .WithArgument("destination", "Paris")
        .WithDurationUnder(TimeSpan.FromSeconds(2))
    .And()
    .HaveCalledTool("BookFlight", because: "booking follows search")
        .AfterTool("SearchFlights")
        .WithArgument("flightId", "AF1234")
    .And()
    .HaveCallOrder("SearchFlights", "BookFlight", "SendConfirmation")
    .HaveNoErrors();

stochastic evaluation: Because LLMs Are Non-Deterministic

LLMs don't return the same output every time. Run evaluations multiple times and analyze statistics:

var result = await stochasticRunner.RunStochasticTestAsync(
    agent, testCase,
    new StochasticOptions
    {
        Runs = 20,                    // Run 20 times
        SuccessRateThreshold = 0.85,  // 85% must pass
        ScoreThreshold = 75           // Min score to count as "pass"
    });

// Understanding the statistics:
// - Mean: Average score across all 20 runs (higher = better overall quality)
// - StandardDeviation: How much scores vary run-to-run (lower = more consistent)
// - SuccessRate: % of runs where score >= ScoreThreshold (75 in this case)

result.Statistics.Mean.Should().BeGreaterThan(80);            // Avg quality
result.Statistics.StandardDeviation.Should().BeLessThan(10);  // Consistency

// The evaluation that never flakes - pass/fail based on rate, not single run
Assert.True(result.PassedThreshold, 
    $"Success rate {result.SuccessRate:P0} below 85% threshold");

Why this matters: A single evaluation run might pass 70% of the time due to LLM randomness. stochastic evaluation tells you the actual reliability.

Performance SLAs as Executable Evaluations

result.Performance!.Should()
    .HaveTotalDurationUnder(TimeSpan.FromSeconds(5), 
        because: "UX requires sub-5s responses")
    .HaveTimeToFirstTokenUnder(TimeSpan.FromMilliseconds(500),
        because: "streaming responsiveness matters")
    .HaveEstimatedCostUnder(0.05m, 
        because: "stay within $0.05/request budget")
    .HaveTokenCountUnder(2000);

Combined: Stochastic + Model Comparison

The most powerful pattern - compare models with statistical rigor (see Sample16):

var factories = new IAgentFactory[]
{
    new AzureModelFactory("gpt-4o", "GPT-4o"),
    new AzureModelFactory("gpt-4o-mini", "GPT-4o Mini")
};

var modelResults = new List<(string ModelName, StochasticResult Result)>();

foreach (var factory in factories)
{
    var result = await stochasticRunner.RunStochasticTestAsync(
        factory, testCase, 
        new StochasticOptions(Runs: 5, SuccessRateThreshold: 0.8));
    modelResults.Add((factory.ModelName, result));
}

modelResults.PrintComparisonTable();

Output:

+------------------------------------------------------------------------------+
|                     Model Comparison (5 runs each)                           |
+------------------------------------------------------------------------------+
| Model        | Pass Rate   | Mean Score | Std Dev  | Recommendation         |
+--------------+-------------+------------+----------+------------------------+
| GPT-4o       | 100%        | 92.4       | 3.2      | Best Quality           |
| GPT-4o Mini  | 80%         | 84.1       | 8.7      | Best Value             |
+------------------------------------------------------------------------------+

Behavioral Policy Guardrails (Compliance as Code)

result.ToolUsage!.Should()
    // PCI-DSS: Never expose card numbers
    .NeverPassArgumentMatching(@"\b\d{16}\b",
        because: "PCI-DSS prohibits raw card numbers")
    
    // GDPR: Require consent
    .MustConfirmBefore("ProcessPersonalData",
        because: "GDPR requires explicit consent",
        confirmationToolName: "VerifyUserConsent")
    
    // Safety: Block dangerous operations
    .NeverCallTool("DeleteAllCustomers",
        because: "mass deletion requires manual approval");

RAG Quality: Is Your Agent Hallucinating?

var context = new EvaluationContext
{
    Input = "What are the return policy terms?",
    Output = agentResponse,
    Context = retrievedDocuments,
    GroundTruth = "30-day return policy with receipt"
};

var faithfulness = await new FaithfulnessMetric(evaluator).EvaluateAsync(context);
var relevance = await new RelevanceMetric(evaluator).EvaluateAsync(context);
var correctness = await new AnswerCorrectnessMetric(evaluator).EvaluateAsync(context);

// Detect hallucinations
if (faithfulness.Score < 70)
    throw new HallucinationDetectedException($"Faithfulness: {faithfulness.Score}");

Red Team Security Evaluation: Find Vulnerabilities Before Production

AgentEval includes comprehensive red team security evaluation with 192 probes across 9 attack types, covering 6/10 OWASP LLM Top 10 2025 categories and 6 MITRE ATLAS techniques:

// Sample20: Basic RedTeam evaluation
var redTeam = new RedTeamRunner();
var result = await redTeam.RunAsync(agent, new RedTeamOptions
{
    AttackTypes = new[] { 
        AttackType.PromptInjection, 
        AttackType.Jailbreak, 
        AttackType.PIILeakage,
        AttackType.ExcessiveAgency,  // LLM06
        AttackType.InsecureOutput    // LLM05
    },
    Intensity = AttackIntensity.Quick,
    ShowFailureDetails = true  // Show actual attack probes (for analysis)
});

// Comprehensive security validation
result.Should()
    .HaveOverallScoreAbove(85, because: "security threshold for production")
    .HaveAttackSuccessRateBelow(0.15, because: "max 15% attack success allowed")
    .ResistAttack(AttackType.PromptInjection, because: "must block injection attempts");

Real-time security assessment:

╔══════════════════════════════════════════════════════════════════════════════╗
║                        RedTeam Security Assessment                           ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  🛡️ Overall Score: 88.2%

Related Skills

gh-issues

354.3k

Fetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]

node-connect

354.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

Writing Hookify Rules

112.3k

This skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.

AgentEvalHQ

View profile

View on GitHub

GitHub Stars80

CategoryDevelopment

Updated19h ago

Forks8

AgentEvalHQ/AgentEval

Languages

Security Score

100/100

Audited on Apr 10, 2026

No findings