⚖️ OpenEvals

Much like tests in traditional software, evals are an important part of bringing LLM applications to production. The goal of this package is to help provide a starting point for you to write evals for your LLM applications, from which you can write more custom evals specific to your application.

If you are looking for evals specific to evaluating LLM agents, please check out agentevals.

Quickstart

[!TIP] If you'd like to follow along with a video walkthrough, click the image below:

To get started, install openevals:

<details open> <summary>Python</summary>

pip install openevals

</details> <details> <summary>TypeScript</summary>

npm install openevals @langchain/core

</details>

This quickstart will use an evaluator powered by OpenAI's gpt-5.4 model to judge your results, so you'll need to set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your_openai_api_key"

Once you've done this, you can run your first eval:

<details open> <summary>Python</summary>

from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT

conciseness_evaluator = create_llm_as_judge(
    # CONCISENESS_PROMPT is just an f-string
    prompt=CONCISENESS_PROMPT,
    model="openai:gpt-5.4",
)

inputs = "How is the weather in San Francisco?"
# These are fake outputs, in reality you would run your LLM-based system to get real outputs
outputs = "Thanks for asking! The current weather in San Francisco is sunny and 90 degrees."
# When calling an LLM-as-judge evaluator, parameters are formatted directly into the prompt
eval_result = conciseness_evaluator(
    inputs=inputs,
    outputs=outputs,
)

print(eval_result)

{
    'key': 'score',
    'score': False,
    'comment': 'The output includes an unnecessary greeting ("Thanks for asking!") and extra..'
}

</details> <details> <summary>TypeScript</summary>

import { createLLMAsJudge, CONCISENESS_PROMPT } from "openevals";

const concisenessEvaluator = createLLMAsJudge({
  // CONCISENESS_PROMPT is just an f-string
  prompt: CONCISENESS_PROMPT,
  model: "openai:gpt-5.4",
});

const inputs = "How is the weather in San Francisco?"
// These are fake outputs, in reality you would run your LLM-based system to get real outputs
const outputs = "Thanks for asking! The current weather in San Francisco is sunny and 90 degrees."

// When calling an LLM-as-judge evaluator, parameters are formatted directly into the prompt
const evalResult = await concisenessEvaluator({
  inputs,
  outputs,
});

console.log(evalResult);

{
    key: 'score',
    score: false,
    comment: 'The output includes an unnecessary greeting ("Thanks for asking!") and extra..'
}

</details>

This is an example of a reference-free evaluator - some other evaluators may accept slightly different parameters such as a required reference output. LLM-as-judge evaluators will attempt to format any passed parameters into their passed prompt, allowing you to flexibly customize criteria or add other fields.

See the LLM-as-judge section for more information on how to customize the scoring to output float values rather than just True/False, the model, or the prompt!

⚖️ OpenEvals
Quickstart
Table of Contents
Installation
Evaluators
- <details> <summary><a href="#llm-as-judge">LLM-as-Judge</a></summary>
</details>
- <details> <summary><a href="#prebuilt-prompts">Prebuilt prompts</a></summary>
  - Quality
  - Safety
  - Security
  - Image
  - Voice (beta)
  - <details> <summary><a href="#rag">RAG</a></summary>
  </details>
</details>
- <details> <summary><a href="#extraction-and-tool-calls">Extraction and tool calls</a></summary>
  - Evaluating structured output with exact match
  - Evaluating structured output with LLM-as-a-Judge
</details>
- <details> <summary><a href="#code">Code</a></summary>
</details>
- <details> <summary><a href="#sandboxed-code">Sandboxed code</a></summary>
</details>
- <details> <summary><a href="#agent-trajectory">Agent trajectory</a></summary>
</details>
- <details> <summary><a href="#other">Other</a></summary>
</details>
- Creating your own
- Python async support
Multiturn Simulation
- Simulating users
  - Prebuilt simulated user
  - Custom simulated users
- Multiturn simulation with LangGraph
LangSmith Integration
- Pytest or Vitest/Jest
- Evaluate
Acknowledgements
Thank you!

Installation

You can install openevals like this:

<details open> <summary>Python</summary>

pip install openevals

</details> <details> <summary>TypeScript</summary>

npm install openevals @langchain/core

</details>

For LLM-as-judge evaluators, you will also need an LLM client. By default, openevals will use LangChain chat model integrations and comes with langchain_openai installed by default. However, if you prefer, you may use the OpenAI client directly:

<details open> <summary>Python</summary>

pip install openai

</details> <details> <summary>TypeScript</summary>

npm install openai

</details>

It is also helpful to be familiar with some evaluation concepts.

Evaluators

LLM-as-judge

One common way to evaluate an LLM app's outputs is to use another LLM as a judge. This is generally a good starting point for evals.

This package contains the create_llm_as_judge function, which takes a prompt and a model as input, and returns an evaluator function that handles converting parameters into strings and parsing the judge LLM's outputs as a score.

To use the create_llm_as_judge function, you need to provide a prompt and a model. To get started, OpenEvals has some prebuilt prompts in the openevals.prompts module that you can use out of the box. Here's an example:

<details open> <summary>Python</summary>

from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:gpt-5.4",
)

</details> <details> <summary>TypeScript</summary>

import { createLLMAsJudge, CORRECTNESS_PROMPT } from "openevals";

const correctnessEvaluator = createLLMAsJudge({
  prompt: CORRECTNESS_PROMPT,
  model: "openai:gpt-5.4",
});

</details>

Note that CORRECTNESS_PROMPT is a simple f-string that you can log and edit as needed for your specific use case:

<details open> <summary>Python</summary>

print(CORRECTNESS_PROMPT)

You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

<Rubric>
  A correct answer:
  - Provides accurate and complete information
  ...
<input>

Openevals

Install / Use

README

⚖️ OpenEvals

Quickstart

Table of Contents

Installation

Evaluators

LLM-as-judge

Related Skills