SkillAgentSearch skills...

OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Install / Use

/learn @agentscope-ai/OpenJudge

README

<div align="center"> <img src="./docs/images/logo.svg" alt="Open-Judge Logo" width="500"> <br/> <h3> <em>Holistic Evaluation, Quality Rewards: Driving Application Excellence</em> </h3> <p> 🌟 <em>If you find OpenJudge helpful, please give us a <b>Star</b>!</em> 🌟 </p>

Python 3.10+ PyPI Documentation Website Try Online

🌐 Website | 🚀 Try Online | 📖 Documentation | 🤝 Contributing | 中文

</div>

OpenJudge is an open-source evaluation framework for AI applications (e.g., AI agents or chatbots) designed to evaluate quality and drive continuous application optimization.

In practice, application excellence depends on a trustworthy evaluation workflow: Collect test data → Define graders → Run evaluation at scale → Analyze weaknesses → Iterate quickly.

OpenJudge provides ready-to-use graders and supports generating scenario-specific rubrics (as graders), making this workflow simpler, more professional, and easy to integrate into your workflow. It can also convert grading results into reward signals to help you fine-tune and optimize your application.

🚀 Try it now! Visit openjudge.me/app to use graders online — no installation required. Test built-in graders, build custom rubrics, and explore evaluation results directly in your browser.


📑 Table of Contents


News

  • 2026-03-10 - 🛠️ New Skills - Claude authenticity verification, find skills combo, and more. 👉 Browse Skills

  • 2026-02-12 - 📚 Reference Hallucination Arena - Benchmark for evaluating LLM academic reference hallucination. 👉 Documentation | 📊 Leaderboard

  • 2026-01-27 - 🆕 Paper Review - Automatically review academic papers using LLM-powered evaluation. 👉 Documentation

  • 2026-01-27 - 🖥️ OpenJudge UI - A Streamlit-based visual interface for grader testing and Auto Arena. 👉 Try Online | Run locally: streamlit run ui/app.py


✨ Key Features

📦 Systematic & Quality-Assured Grader Library

Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.

<table> <tr> <td width="33%" valign="top">

🎯 General

Focus: Semantic quality, functional correctness, structural compliance

Key Graders:

  • Relevance - Semantic relevance scoring
  • Similarity - Text similarity measurement
  • Syntax Check - Code syntax validation
  • JSON Match - Structure compliance
</td> <td width="33%" valign="top">

🤖 Agent

Focus: Agent lifecycle, tool calling, memory, plan feasibility, trajectory quality

Key Graders:

  • Tool Selection - Tool choice accuracy
  • Memory - Context preservation
  • Plan - Strategy feasibility
  • Trajectory - Path optimization
</td> <td width="33%" valign="top">

🖼️ Multimodal

Focus: Image-text coherence, visual generation quality, image helpfulness

Key Graders:

  • Image Coherence - Visual-text alignment
  • Text-to-Image - Generation quality
  • Image Helpfulness - Image contribution
</td> </tr> </table>
  • 🌐 Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks. 👉 Explore Supported Scenarios
  • 🔄 Holistic Agent Evaluation: Beyond final outcomes, we assess the entire lifecycle—including trajectories, Memory, Reflection, and Tool Use. 👉 Agent Lifecycle Evaluation
  • Quality Assurance: Every grader comes with benchmark datasets and pytest integration for validation. 👉 View Benchmark Datasets

🛠️ Flexible Grader Building Methods

Choose the build method that fits your requirements:

  • Customization: Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. 👉 Custom Grader Development Guide
  • Zero-shot Rubrics Generation: Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. 👉 Zero-shot Rubrics Generation Guide
  • Data-driven Rubrics Generation: Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. 👉 Data-driven Rubrics Generation Guide
  • Training Judge Models: Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.👉 Train Judge Models

🔌 Easy Integration

Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We also provide integrations with training frameworks like VERL for RL training. 👉 See Integrations for details

🌐 Online Playground

Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:

  • Test graders interactively — select a built-in grader, input your data, and see results instantly
  • Build custom rubrics — use the zero-shot generator to create graders from task descriptions
  • View leaderboards — compare model performance across evaluation benchmarks at openjudge.me/leaderboard

📥 Installation

💡 Don't want to install anything? Try OpenJudge online — use graders directly in your browser, no setup needed.

pip install py-openjudge

💡 More installation methods can be found in the Quickstart Guide.


🚀 Quickstart

📚 Complete Quickstart can be found in the Quickstart Guide.

Simple Example

A simple example to evaluate a single response:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader

async def main():
    # 1️⃣ Create model client
    model = OpenAIChatModel(model="qwen3-32b")
    # 2️⃣ Initialize grader
    grader = RelevanceGrader(model=model)
    # 3️⃣ Prepare data
    data = {
        "query": "What is machine learning?",
        "response": "Machine learning is a subset of AI that enables computers to learn from data.",
    }
    # 4️⃣ Evaluate
    result = await grader.aevaluate(**data)
    print(f"Score: {result.score}")   # Score: 4
    print(f"Reason: {result.reason}")

if __name__ == "__main__":
    asyncio.run(main())

Evaluate LLM Applications with Built-in Graders

Use multiple built-in graders to comprehensively evaluate your LLM application: 👉 Explore All built-in graders

Business Scenario: Evaluating an e-commerce customer service agent that handles order inquiries. We assess the agent's performance across three dimensions: relevance, hallucination, and tool selection.

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common import RelevanceGrader, HallucinationGrader
from openjudge.graders.agent.tool.tool_selection import ToolSelectionGrader
from openjudge.runner import GradingRunner
from openjudge.runner.aggregator import WeightedSumAggregator
from openjudge.analyzer.statistical import DistributionAnalyzer

TOOL_DEFINITIONS = [
    {"name": "query_order", "description": "Query order status and logistics information", "parameters": {"order_id": "str"}},
    {"name": "query_logistics", "description": "Query detailed logistics tracking", "parameters": {"order_id": "str"}},
    {"name": "estimate_delivery", "description": "Estimate delivery time", "parameters": {"order_id": "str"}},
]
# Prepare your dataset
dataset = [{
    "query": "Where 
View on GitHub
GitHub Stars514
CategoryDevelopment
Updated2h ago
Forks41

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings