Agentevals

Readymade evaluators for agent trajectories

Generate Convert Improve

Install / Use

/learn @langchain-ai/Agentevals

About this skill

Quality Score

0/100

README

🦾⚖️ AgentEvals

Agentic applications give an LLM freedom over control flow in order to solve problems. While this freedom can be extremely powerful, the black box nature of LLMs can make it difficult to understand how changes in one part of your agent will affect others downstream. This makes evaluating your agents especially important.

This package contains a collection of evaluators and utilities for evaluating the performance of your agents, with a focus on agent trajectory, or the intermediate steps an agent takes as it runs. It is intended to provide a good conceptual starting point for your agent's evals.

If you are looking for more general evaluation tools, please check out the companion package openevals.

Quickstart

To get started, install agentevals:

<details open> <summary>Python</summary>

pip install agentevals

</details> <details> <summary>TypeScript</summary>

npm install agentevals @langchain/core

</details>

This quickstart will use an evaluator powered by OpenAI's o3-mini model to judge your results, so you'll need to set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your_openai_api_key"

Once you've done this, you can run your first trajectory evaluator. We represent the agent's trajectory as a list of OpenAI-style messages:

<details open> <summary>Python</summary>

from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT

trajectory_evaluator = create_trajectory_llm_as_judge(
    prompt=TRAJECTORY_ACCURACY_PROMPT,
    model="openai:o3-mini",
)

# This is a fake trajectory, in reality you would run your agent to get a real trajectory
outputs = [
    {"role": "user", "content": "What is the weather in SF?"},
    {
        "role": "assistant",
        "content": "",
        "tool_calls": [
            {
                "function": {
                    "name": "get_weather",
                    "arguments": json.dumps({"city": "SF"}),
                }
            }
        ],
    },
    {"role": "tool", "content": "It's 80 degrees and sunny in SF."},
    {"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]

eval_result = trajectory_evaluator(
  outputs=outputs,
)

print(eval_result)

{
  'key': 'trajectory_accuracy',
  'reasoning': 'The trajectory accurately follows the user's request for weather information in SF. Initially, the assistant recognizes the goal (providing weather details), then it efficiently makes a tool call to get the weather, and finally it communicates the result clearly. All steps demonstrate logical progression and efficiency. Thus, the score should be: true.',
  'score': true
}

</details> <details> <summary>TypeScript</summary>

import {
  createTrajectoryLLMAsJudge,
  type FlexibleChatCompletionMessage,
  TRAJECTORY_ACCURACY_PROMPT,
} from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  prompt: TRAJECTORY_ACCURACY_PROMPT,
  model: "openai:o3-mini",
});

const outputs = [
  { role: "user", content: "What is the weather in SF?" },
  {
    role: "assistant",
    content: "",
    tool_calls: [
      {
        function: {
          name: "get_weather",
          arguments: JSON.stringify({ city: "SF" }),
        },
      },
    ],
  },
  { role: "tool", content: "It's 80 degrees and sunny in SF." },
  {
    role: "assistant",
    content: "The weather in SF is 80 degrees and sunny.",
  },
] satisfies FlexibleChatCompletionMessage[];

const evalResult = await trajectoryEvaluator({
  outputs,
});

console.log(evalResult);

{
    key: 'trajectory_accuracy',
    score: true,
    comment: '...'
}

</details>

You can see that the evaluator returns a score of true since the overall trajectory is a reasonable path for the agent to take to answer the user's question.

For more details on this evaluator, including how to customize it, see the section on trajectory LLM-as-judge.

Installation
Evaluators
Python Async Support
LangSmith Integration
- Pytest or Vitest/Jest
- Evaluate

Installation

You can install agentevals like this:

<details open> <summary>Python</summary>

pip install agentevals

</details> <details> <summary>TypeScript</summary>

npm install agentevals @langchain/core

</details>

For LLM-as-judge evaluators, you will also need an LLM client. By default, agentevals will use LangChain chat model integrations and comes with langchain_openai installed by default. However, if you prefer, you may use the OpenAI client directly:

<details open> <summary>Python</summary>

pip install openai

</details> <details> <summary>TypeScript</summary>

npm install openai

</details>

It is also helpful to be familiar with some evaluation concepts and LangSmith's pytest integration for running evals, which is documented here.

Evaluators

Agent trajectory match

Agent trajectory match evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM. These evaluators expect you to format your agent's trajectory as a list of OpenAI format dicts or as a list of LangChain BaseMessage classes, and handle message formatting under the hood.

AgentEvals offers the create_trajectory_match_evaluator/createTrajectoryMatchEvaluator and create_async_trajectory_match_evaluator methods for this task. You can customize their behavior in a few ways:

Setting trajectory_match_mode/trajectoryMatchMode to strict, unordered, subset, or superset to provide the general strategy the evaluator will use to compare trajectories
Setting tool_args_match_mode and/or tool_args_match_overrides to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal.

Strict match

The "strict" trajectory_match_mode compares two trajectories and ensures that they contain the same messages in the same order with the same tool calls. Note that it does allow for differences in message content:

<details open> <summary>Python</summary>

import json
from agentevals.trajectory.match import create_trajectory_match_evaluator

outputs = [
    {"role": "user", "content": "What is the weather in SF?"},
    {
        "role": "assistant",
        "content": "",
        "tool_calls": [
            {
                "function": {
                    "name": "get_weather",
                    "arguments": json.dumps({"city": "San Francisco"}),
                }
            },
            {
                "function": {
                    "name": "accuweather_forecast",
                    "arguments": json.dumps({"city": "San Francisco"}),
                }
            }
        ],
    },
    {"role": "tool", "content": "It's 80 degrees and sunny in SF."},
    {"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]
reference_outputs = [
    {"role": "user", "content": "What is the weather in San Francisco?"},
    {
        "role": "assistant",
        "content": "",
        "tool_calls": [
            {
                "function": {
                    "name": "get_weather",
                    "arguments": json.dumps({"city": "San Francisco"}),
                }
            }
        ],
    },
    {"role": "tool", "content": "It's 80 degrees and sunny in San Francisco."},
    {"role": "assistant", "content": "The weather in SF is 80˚ and sunny."},
]

evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="strict"
)

result = evaluator(
    outputs=outputs, reference_outputs=reference_outputs
)

print(result)

{
    'key': 'trajectory_strict_match',
    'score': False,
    'comment': None,
}

</details> <details> <summary>TypeScript</summary>

import {
  createTrajectoryMatchEvaluator,
  type FlexibleChatCompletionMessage,
} from "agentevals";

const outputs = [
  { role: "user", content: "What is the weather in SF?" },
  {
    role: "assistant",
    content: "",
    tool_calls: [{
      function: {
        name: "get_weather",
        arguments: JSON.stringify({ city: "San Francisco" })
      },
    }, {
      function: {
        name: "accuweather_forecast",
        arguments: JSON.stringify({"city": "San Francisco"}),
      },
    }]
  },
  { role: "tool", content: "It's 80 degrees and sunny in SF." },
  { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
] satisfies FlexibleChatCompletionMessage[];

const referenceOutputs = [
  { role: "user", content: "What is the weather in San Francisco?" },
  {
    role: "assistant",
    content: "",
    tool_calls: [{
      function: {
        name: "get_weather",
        arguments: JSON.stringify({ city: "San

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

langchain-ai

View profile

View on GitHub

GitHub Stars534

CategoryDevelopment

Updated5h ago

Forks36

langchain-ai/agentevals

Languages

Python

Security Score

95/100

Audited on Apr 2, 2026

No findings

Agentevals

Install / Use

README

🦾⚖️ AgentEvals

Quickstart

Table of Contents

Installation

Evaluators

Agent trajectory match

Strict match

Related Skills