Agentevals
Readymade evaluators for agent trajectories
Install / Use
/learn @langchain-ai/AgentevalsREADME
🦾⚖️ AgentEvals
Agentic applications give an LLM freedom over control flow in order to solve problems. While this freedom can be extremely powerful, the black box nature of LLMs can make it difficult to understand how changes in one part of your agent will affect others downstream. This makes evaluating your agents especially important.
This package contains a collection of evaluators and utilities for evaluating the performance of your agents, with a focus on agent trajectory, or the intermediate steps an agent takes as it runs. It is intended to provide a good conceptual starting point for your agent's evals.
If you are looking for more general evaluation tools, please check out the companion package openevals.
Quickstart
To get started, install agentevals:
pip install agentevals
</details>
<details>
<summary>TypeScript</summary>
npm install agentevals @langchain/core
</details>
This quickstart will use an evaluator powered by OpenAI's o3-mini model to judge your results, so you'll need to set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your_openai_api_key"
Once you've done this, you can run your first trajectory evaluator. We represent the agent's trajectory as a list of OpenAI-style messages:
<details open> <summary>Python</summary>from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
trajectory_evaluator = create_trajectory_llm_as_judge(
prompt=TRAJECTORY_ACCURACY_PROMPT,
model="openai:o3-mini",
)
# This is a fake trajectory, in reality you would run your agent to get a real trajectory
outputs = [
{"role": "user", "content": "What is the weather in SF?"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "get_weather",
"arguments": json.dumps({"city": "SF"}),
}
}
],
},
{"role": "tool", "content": "It's 80 degrees and sunny in SF."},
{"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]
eval_result = trajectory_evaluator(
outputs=outputs,
)
print(eval_result)
{
'key': 'trajectory_accuracy',
'reasoning': 'The trajectory accurately follows the user's request for weather information in SF. Initially, the assistant recognizes the goal (providing weather details), then it efficiently makes a tool call to get the weather, and finally it communicates the result clearly. All steps demonstrate logical progression and efficiency. Thus, the score should be: true.',
'score': true
}
</details>
<details>
<summary>TypeScript</summary>
import {
createTrajectoryLLMAsJudge,
type FlexibleChatCompletionMessage,
TRAJECTORY_ACCURACY_PROMPT,
} from "agentevals";
const trajectoryEvaluator = createTrajectoryLLMAsJudge({
prompt: TRAJECTORY_ACCURACY_PROMPT,
model: "openai:o3-mini",
});
const outputs = [
{ role: "user", content: "What is the weather in SF?" },
{
role: "assistant",
content: "",
tool_calls: [
{
function: {
name: "get_weather",
arguments: JSON.stringify({ city: "SF" }),
},
},
],
},
{ role: "tool", content: "It's 80 degrees and sunny in SF." },
{
role: "assistant",
content: "The weather in SF is 80 degrees and sunny.",
},
] satisfies FlexibleChatCompletionMessage[];
const evalResult = await trajectoryEvaluator({
outputs,
});
console.log(evalResult);
{
key: 'trajectory_accuracy',
score: true,
comment: '...'
}
</details>
You can see that the evaluator returns a score of true since the overall trajectory is a reasonable path for the agent to take to answer the user's question.
For more details on this evaluator, including how to customize it, see the section on trajectory LLM-as-judge.
Table of Contents
Installation
You can install agentevals like this:
pip install agentevals
</details>
<details>
<summary>TypeScript</summary>
npm install agentevals @langchain/core
</details>
For LLM-as-judge evaluators, you will also need an LLM client. By default, agentevals will use LangChain chat model integrations and comes with langchain_openai installed by default. However, if you prefer, you may use the OpenAI client directly:
pip install openai
</details>
<details>
<summary>TypeScript</summary>
npm install openai
</details>
It is also helpful to be familiar with some evaluation concepts and LangSmith's pytest integration for running evals, which is documented here.
Evaluators
Agent trajectory match
Agent trajectory match evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
These evaluators expect you to format your agent's trajectory as a list of OpenAI format dicts or as a list of LangChain BaseMessage classes, and handle message formatting
under the hood.
AgentEvals offers the create_trajectory_match_evaluator/createTrajectoryMatchEvaluator and create_async_trajectory_match_evaluator methods for this task. You can customize their behavior in a few ways:
- Setting
trajectory_match_mode/trajectoryMatchModetostrict,unordered,subset, orsupersetto provide the general strategy the evaluator will use to compare trajectories - Setting
tool_args_match_modeand/ortool_args_match_overridesto customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal.
Strict match
The "strict" trajectory_match_mode compares two trajectories and ensures that they contain the same messages
in the same order with the same tool calls. Note that it does allow for differences in message content:
import json
from agentevals.trajectory.match import create_trajectory_match_evaluator
outputs = [
{"role": "user", "content": "What is the weather in SF?"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "get_weather",
"arguments": json.dumps({"city": "San Francisco"}),
}
},
{
"function": {
"name": "accuweather_forecast",
"arguments": json.dumps({"city": "San Francisco"}),
}
}
],
},
{"role": "tool", "content": "It's 80 degrees and sunny in SF."},
{"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]
reference_outputs = [
{"role": "user", "content": "What is the weather in San Francisco?"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "get_weather",
"arguments": json.dumps({"city": "San Francisco"}),
}
}
],
},
{"role": "tool", "content": "It's 80 degrees and sunny in San Francisco."},
{"role": "assistant", "content": "The weather in SF is 80˚ and sunny."},
]
evaluator = create_trajectory_match_evaluator(
trajectory_match_mode="strict"
)
result = evaluator(
outputs=outputs, reference_outputs=reference_outputs
)
print(result)
{
'key': 'trajectory_strict_match',
'score': False,
'comment': None,
}
</details>
<details>
<summary>TypeScript</summary>
import {
createTrajectoryMatchEvaluator,
type FlexibleChatCompletionMessage,
} from "agentevals";
const outputs = [
{ role: "user", content: "What is the weather in SF?" },
{
role: "assistant",
content: "",
tool_calls: [{
function: {
name: "get_weather",
arguments: JSON.stringify({ city: "San Francisco" })
},
}, {
function: {
name: "accuweather_forecast",
arguments: JSON.stringify({"city": "San Francisco"}),
},
}]
},
{ role: "tool", content: "It's 80 degrees and sunny in SF." },
{ role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
] satisfies FlexibleChatCompletionMessage[];
const referenceOutputs = [
{ role: "user", content: "What is the weather in San Francisco?" },
{
role: "assistant",
content: "",
tool_calls: [{
function: {
name: "get_weather",
arguments: JSON.stringify({ city: "San
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
