Openevals
Readymade evaluators for your LLM apps
Install / Use
/learn @langchain-ai/OpenevalsREADME
⚖️ OpenEvals
Much like tests in traditional software, evals are an important part of bringing LLM applications to production. The goal of this package is to help provide a starting point for you to write evals for your LLM applications, from which you can write more custom evals specific to your application.
If you are looking for evals specific to evaluating LLM agents, please check out agentevals.
Quickstart
[!TIP] If you'd like to follow along with a video walkthrough, click the image below:
To get started, install openevals:
pip install openevals
</details>
<details>
<summary>TypeScript</summary>
npm install openevals @langchain/core
</details>
This quickstart will use an evaluator powered by OpenAI's gpt-5.4 model to judge your results, so you'll need to set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your_openai_api_key"
Once you've done this, you can run your first eval:
<details open> <summary>Python</summary>from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT
conciseness_evaluator = create_llm_as_judge(
# CONCISENESS_PROMPT is just an f-string
prompt=CONCISENESS_PROMPT,
model="openai:gpt-5.4",
)
inputs = "How is the weather in San Francisco?"
# These are fake outputs, in reality you would run your LLM-based system to get real outputs
outputs = "Thanks for asking! The current weather in San Francisco is sunny and 90 degrees."
# When calling an LLM-as-judge evaluator, parameters are formatted directly into the prompt
eval_result = conciseness_evaluator(
inputs=inputs,
outputs=outputs,
)
print(eval_result)
{
'key': 'score',
'score': False,
'comment': 'The output includes an unnecessary greeting ("Thanks for asking!") and extra..'
}
</details>
<details>
<summary>TypeScript</summary>
import { createLLMAsJudge, CONCISENESS_PROMPT } from "openevals";
const concisenessEvaluator = createLLMAsJudge({
// CONCISENESS_PROMPT is just an f-string
prompt: CONCISENESS_PROMPT,
model: "openai:gpt-5.4",
});
const inputs = "How is the weather in San Francisco?"
// These are fake outputs, in reality you would run your LLM-based system to get real outputs
const outputs = "Thanks for asking! The current weather in San Francisco is sunny and 90 degrees."
// When calling an LLM-as-judge evaluator, parameters are formatted directly into the prompt
const evalResult = await concisenessEvaluator({
inputs,
outputs,
});
console.log(evalResult);
{
key: 'score',
score: false,
comment: 'The output includes an unnecessary greeting ("Thanks for asking!") and extra..'
}
</details>
This is an example of a reference-free evaluator - some other evaluators may accept slightly different parameters such as a required reference output. LLM-as-judge evaluators will attempt to format any passed parameters into their passed prompt, allowing you to flexibly customize criteria or add other fields.
See the LLM-as-judge section for more information on how to customize the scoring to output float values rather than just True/False, the model, or the prompt!
Table of Contents
-
- <details> <summary><a href="#llm-as-judge">LLM-as-Judge</a></summary>
-
<details>
<summary><a href="#prebuilt-prompts">Prebuilt prompts</a></summary>
-
Voice (beta)
- <details> <summary><a href="#rag">RAG</a></summary>
- <details> <summary><a href="#extraction-and-tool-calls">Extraction and tool calls</a></summary>
- <details> <summary><a href="#code">Code</a></summary>
- <details> <summary><a href="#sandboxed-code">Sandboxed code</a></summary>
- <details> <summary><a href="#agent-trajectory">Agent trajectory</a></summary>
- <details> <summary><a href="#other">Other</a></summary>
Installation
You can install openevals like this:
pip install openevals
</details>
<details>
<summary>TypeScript</summary>
npm install openevals @langchain/core
</details>
For LLM-as-judge evaluators, you will also need an LLM client. By default, openevals will use LangChain chat model integrations and comes with langchain_openai installed by default. However, if you prefer, you may use the OpenAI client directly:
pip install openai
</details>
<details>
<summary>TypeScript</summary>
npm install openai
</details>
It is also helpful to be familiar with some evaluation concepts.
Evaluators
LLM-as-judge
One common way to evaluate an LLM app's outputs is to use another LLM as a judge. This is generally a good starting point for evals.
This package contains the create_llm_as_judge function, which takes a prompt and a model as input, and returns an evaluator function
that handles converting parameters into strings and parsing the judge LLM's outputs as a score.
To use the create_llm_as_judge function, you need to provide a prompt and a model. To get started, OpenEvals has some prebuilt prompts in the openevals.prompts module that you can use out of the box. Here's an example:
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
correctness_evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
model="openai:gpt-5.4",
)
</details>
<details>
<summary>TypeScript</summary>
import { createLLMAsJudge, CORRECTNESS_PROMPT } from "openevals";
const correctnessEvaluator = createLLMAsJudge({
prompt: CORRECTNESS_PROMPT,
model: "openai:gpt-5.4",
});
</details>
Note that CORRECTNESS_PROMPT is a simple f-string that you can log and edit as needed for your specific use case:
print(CORRECTNESS_PROMPT)
You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:
<Rubric>
A correct answer:
- Provides accurate and complete information
...
<input>
Related Skills
node-connect
325.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
325.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.2kCommit, push, and open a PR

